0.1 Libraries Required

1 1. Introduction

This project investigates the relationship between county-level broadband connectivity and social vulnerability across the United States. We use:

This project follows a fully reproducible workflow using R Markdown, where all data cleaning, merging, analysis, and visualizations are generated inside this document. # 1.1 Data Inventory Table

  1. Social Vulnerability Index (SVI)

Dataset Name : CDC/ATSDR Social Vulnerability Index (SVI) Source / URL : https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html Year Used : 2020 Geographic Level: County (FIPS) Format: CSV Files Used: SVI2020_US_COUNTY.csv Size: ~5–6 MB Key Variables: County FIPS, overall SVI percentile, SVI Theme scores (Socioeconomic, Household Composition, Minority Status, Housing/Transportation) Data Quality Notes: Percentile values range 0–1; no SVI data for Puerto Rico; a few suppressed values for remote Alaska counties.

  1. FCC Broadband Deployment Data (December 2020) Dataset Name : FCC Form 477 Fixed Broadband Deployment Summary Source / URL : FCC Broadband Deployment Data Year Used: Dec 2020 Geographic Level: County Format: CSV Files Used: fcc_dec_2020_county.csv (cleaned version) Size: ~20–30 MB Key Variables: FIPS, availability of ≥25/3 Mbps broadband, provider counts, technology codes Data Quality Notes: Known issues include over-reporting availability; no coverage for Puerto Rico or territories.

  2. Microsoft Airband Broadband Usage

Dataset Name: Microsoft Airband “US Broadband Usage” Source / URL: https://github.com/microsoft/USBroadbandUsage Year Used: 2020 Geographic Level : County Format : CSV Files Used : airband_2020_county.csv Size: ~2 MB Key Variables : fips, measured broadband usage, observed availability Data Quality Notes : Usage values are decimals (0–1). No PR or territories.

  1. ACS Data — Computer & Internet Use (Table B28002)

Dataset Name : ACS 5-Year Estimates — B28002 Source / URL : data.census.gov Year Used : 2020 Geographic Level : County Format : CSV Files Used : ACSDT5Y2020.B28002-Data.csv Size : ~4 MB Key Variables : Total households, households with broadband, households with no internet access Data Quality Notes : Minor MOE columns removed; numeric conversion required.

  1. ACS Data — Median Household Income (Table B19013)

Dataset Name : ACS 5-Year Estimates — B19013 Source / URL : data.census.gov Year Used : 2020 Geographic Level : County Format : CSV Files Used : ACSDT5Y2020.B19013-Data.csv Size : ~3 MB Key Variables : Median household income Data Quality Notes : No PR values; income suppressed for some remote Alaska regions.

  1. ACS Data — Education Attainment (Table B15003)

Dataset Name : ACS 5-Year Estimates — B15003 Source / URL : data.census.gov Year Used : 2020 Geographic Level : County Format : CSV Files Used : ACSDT5Y2020.B15003-Data.csv Size : ~4 MB Key Variables : Education levels (HS, bachelor’s, master’s, doctorate) Data Quality Notes :Consistent coverage except PR and some small Alaska areas.

  1. Geographic Shapefiles (TIGER/Line 2020)

Dataset Name : TIGER/Line 2020 County Shapefile Source / URL : https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html Year Used : 2020 Geographic Level : County (MULTIPOLYGON) Format : ESRI Shapefile (.shp, .dbf, .shx, .prj) Files Used : tl_2020_us_county.shp + associated files Size : ~50–100 MB Key Variables : GEOID, county name, state FIPS, geometry Data Quality Notes : Includes 3,234 county-equivalents → includes Puerto Rico, territories, and Alaska census areas. Must be filtered before analysis.

2 2. FCC Broadband Data (2016–2024, focusing on 2020)

In this section, we work with the FCC Form 477 county-level broadband tier dataset.
The raw file contains semiannual observations (June and December) for each county from 2016–2024.

2.1 FCC Form 477 includes both deployment (availability) and subscription (adoption) datasets. The deployment dataset — used in this project — measures broadband availability across speed tiers (Tier 1–4) and provides a county-level snapshot of the infrastructure that is available to residents. This is distinct from the subscription dataset, which reflects how many households actually subscribe to broadband. Because our project examines disparities in access, we chosse the deployment dataset.

2.2 2.1 Load and Inspect FCC Raw Data

# Load the raw FCC county tier data file
fcc_raw <- read_csv(
  path_raw("fcc", "form477_fcc_data.csv")
)

# View column names and structure
names(fcc_raw)
##  [1] "Year"          "Month"         "FIPS"          "State"        
##  [5] "County"        "State_Name"    "County_Name"   "Housing_Units"
##  [9] "Tier_1"        "Tier_2"        "Tier_3"        "Tier_4"
glimpse(fcc_raw)
## Rows: 67,926
## Columns: 12
## $ Year          <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 20…
## $ Month         <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ FIPS          <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017, 10…
## $ State         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ County        <dbl> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 3…
## $ State_Name    <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "…
## $ County_Name   <chr> "Autauga County", "Baldwin County", "Barbour County", "B…
## $ Housing_Units <dbl> 25318, 135669, 11816, 9208, 25065, 4616, 9911, 53540, 16…
## $ Tier_1        <dbl> 5, 4, 3, 3, 4, 3, 4, 4, 4, 3, 4, 3, 3, 3, 3, 4, 4, 3, 4,…
## $ Tier_2        <dbl> 5, 4, 3, 3, 3, 3, 4, 4, 4, 3, 4, 2, 3, 3, 3, 4, 4, 2, 4,…
## $ Tier_3        <dbl> 5, 4, 3, 2, 3, 3, 3, 4, 4, 2, 3, 2, 2, 2, 2, 4, 4, 2, 4,…
## $ Tier_4        <dbl> 4, 3, 3, 2, 2, 3, 2, 3, 4, 1, 2, 1, 1, 2, 2, 4, 3, 1, 3,…

2.3 2.2 Clean FCC Data: Keep Only December 2020

fcc_2020_dec <- fcc_raw %>%
  rename_with(tolower) %>%
  filter(year == 2020, month == 12) %>%
  mutate(
    county_fips = str_pad(as.character(fips), 5, side = "left", pad = "0")
  ) %>%
  select(
    county_fips,
    state_name,
    county_name,
    housing_units,
    tier1 = tier_1,
    tier2 = tier_2,
    tier3 = tier_3
  )
head(fcc_2020_dec)
## # A tibble: 6 × 7
##   county_fips state_name county_name    housing_units tier1 tier2 tier3
##   <chr>       <chr>      <chr>                  <dbl> <dbl> <dbl> <dbl>
## 1 01001       Alabama    Autauga County         24165     4     4     4
## 2 01003       Alabama    Baldwin County        122518     4     3     3
## 3 01005       Alabama    Barbour County         12120     3     3     2
## 4 01007       Alabama    Bibb County             9340     3     2     1
## 5 01009       Alabama    Blount County          24625     3     3     2
## 6 01011       Alabama    Bullock County          4625     3     3     1
summary(fcc_2020_dec)
##  county_fips         state_name        county_name        housing_units    
##  Length:3234        Length:3234        Length:3234        Min.   :      0  
##  Class :character   Class :character   Class :character   1st Qu.:   5620  
##  Mode  :character   Mode  :character   Mode  :character   Median :  12697  
##                                                           Mean   :  44079  
##                                                           3rd Qu.:  31556  
##                                                           Max.   :3599420  
##      tier1           tier2           tier3      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :4.000   Median :3.000   Median :3.000  
##  Mean   :3.681   Mean   :3.272   Mean   :2.717  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000

2.4 2.3 Broadband validation and consistency checks

# If not already in memory, load the non-spatial master dataset
master_2020 <- readRDS("processed_data/master_2020_county.rds")

broadband_check <- master_2020 %>%
  select(
    FIPS,
    housing_units,
    tier1, tier2, tier3,
    airband_fcc_availability,
    airband_usage
  )

# 1) Range checks for FCC/Airband measures
summary(broadband_check)
##      FIPS           housing_units         tier1           tier2     
##  Length:3141        Min.   :     56   Min.   :0.000   Min.   :0.00  
##  Class :character   1st Qu.:   5586   1st Qu.:3.000   1st Qu.:3.00  
##  Mode  :character   Median :  12666   Median :4.000   Median :3.00  
##                     Mean   :  44817   Mean   :3.722   Mean   :3.31  
##                     3rd Qu.:  31875   3rd Qu.:4.000   3rd Qu.:4.00  
##                     Max.   :3599420   Max.   :5.000   Max.   :5.00  
##                                                                     
##      tier3       airband_fcc_availability airband_usage   
##  Min.   :0.000   Min.   :0.0002           Min.   :0.0010  
##  1st Qu.:2.000   1st Qu.:0.7698           1st Qu.:0.1960  
##  Median :3.000   Median :0.9244           Median :0.3680  
##  Mean   :2.755   Mean   :0.8404           Mean   :0.3896  
##  3rd Qu.:4.000   3rd Qu.:0.9839           3rd Qu.:0.5660  
##  Max.   :5.000   Max.   :1.0000           Max.   :1.0000  
##                  NA's   :9                NA's   :11
# Check that availability and usage are between 0 and 1
invalid_range <- broadband_check %>%
  filter(
    airband_fcc_availability < 0 | airband_fcc_availability > 1 |
    airband_usage            < 0 | airband_usage            > 1
  )

invalid_range   # ideally 0 rows
## # A tibble: 0 × 7
## # ℹ 7 variables: FIPS <chr>, housing_units <dbl>, tier1 <dbl>, tier2 <dbl>,
## #   tier3 <dbl>, airband_fcc_availability <dbl>, airband_usage <dbl>
# Check that tiers (if present) are within 1–4
invalid_tiers <- broadband_check %>%
  filter(
    (!is.na(tier1) & (tier1 < 1 | tier1 > 4)) |
    (!is.na(tier2) & (tier2 < 1 | tier2 > 4)) |
    (!is.na(tier3) & (tier3 < 1 | tier3 > 4))
  )

invalid_tiers   # ideally 0 rows
## # A tibble: 490 × 7
##    FIPS  housing_units tier1 tier2 tier3 airband_fcc_availability airband_usage
##    <chr>         <dbl> <dbl> <dbl> <dbl>                    <dbl>         <dbl>
##  1 01051         34618     5     4     4                    0.927         0.509
##  2 01083         36869     5     4     4                    0.908         0.555
##  3 01089        169528     5     4     4                    0.964         0.747
##  4 01117         90258     5     5     4                    0.955         0.589
##  5 02013           755     4     3     0                   NA             0.066
##  6 02016          1970     5     4     0                   NA             0.023
##  7 02020        120299     5     4     4                    0.997         0.753
##  8 02060           985     2     1     0                   NA             0.054
##  9 02070          2457     2     1     0                   NA             0.052
## 10 02110         14027     5     4     4                    0.993         0.571
## # ℹ 480 more rows
# 2) Missingness in key broadband fields
colSums(is.na(broadband_check))
##                     FIPS            housing_units                    tier1 
##                        0                        0                        0 
##                    tier2                    tier3 airband_fcc_availability 
##                        0                        0                        9 
##            airband_usage 
##                       11
# 3) Simple consistency check: usage should not greatly exceed availability
usage_gt_avail <- broadband_check %>%
  filter(
    !is.na(airband_fcc_availability),
    !is.na(airband_usage),
    airband_usage > airband_fcc_availability
  )

usage_gt_avail %>% head()
## # A tibble: 6 × 7
##   FIPS  housing_units tier1 tier2 tier3 airband_fcc_availability airband_usage
##   <chr>         <dbl> <dbl> <dbl> <dbl>                    <dbl>         <dbl>
## 1 01063          5191     2     1     1                   0.0085         0.013
## 2 01105          4776     2     2     1                   0.0012         0.032
## 3 02185          2630     3     2     0                   0.0061         0.051
## 4 02188          2763     3     2     1                   0.0081         0.039
## 5 02198          3457     3     3     2                   0.0282         0.04 
## 6 02290          4078     2     1     1                   0.013          0.139
nrow(usage_gt_avail)
## [1] 50

FCC Data Validation Summary

After cleaning the FCC December 2020 county-level dataset, we verified the quality of all key variables before merging them into the master file. All FCC tiers (tier1, tier2, tier3) fall within the correct range of 1–4 with no invalid values detected. Housing-unit counts are strictly positive for all counties, and no negative, zero, or corrupted values were found. The FCC dataset therefore passes internal consistency checks and is suitable for downstream analysis.

2.5 2.4 Save Cleaned FCC Dataset

write_csv(fcc_2020_dec, path_processed("fcc_2020_dec_clean.csv"))
saveRDS(fcc_2020_dec, path_processed("fcc_2020_dec_clean.rds"))

3 3. Social Vulnerability Index (SVI) Data – 2020 (County Level)

The CDC/ATSDR Social Vulnerability Index (SVI) provides an overall vulnerability percentile and four theme-level indices for each U.S. county. Here we load the 2020 county-level SVI dataset.

3.1 3.1 Load and Inspect SVI Raw Data

svi_raw <- read_csv(
  path_raw("svi", "SVI_2020_US_county.csv")
)

names(svi_raw)
##   [1] "ST"           "STATE"        "ST_ABBR"      "STCNTY"       "COUNTY"      
##   [6] "FIPS"         "LOCATION"     "AREA_SQMI"    "E_TOTPOP"     "M_TOTPOP"    
##  [11] "E_HU"         "M_HU"         "E_HH"         "M_HH"         "E_POV150"    
##  [16] "M_POV150"     "E_UNEMP"      "M_UNEMP"      "E_HBURD"      "M_HBURD"     
##  [21] "E_NOHSDP"     "M_NOHSDP"     "E_UNINSUR"    "M_UNINSUR"    "E_AGE65"     
##  [26] "M_AGE65"      "E_AGE17"      "M_AGE17"      "E_DISABL"     "M_DISABL"    
##  [31] "E_SNGPNT"     "M_SNGPNT"     "E_LIMENG"     "M_LIMENG"     "E_MINRTY"    
##  [36] "M_MINRTY"     "E_MUNIT"      "M_MUNIT"      "E_MOBILE"     "M_MOBILE"    
##  [41] "E_CROWD"      "M_CROWD"      "E_NOVEH"      "M_NOVEH"      "E_GROUPQ"    
##  [46] "M_GROUPQ"     "EP_POV150"    "MP_POV150"    "EP_UNEMP"     "MP_UNEMP"    
##  [51] "EP_HBURD"     "MP_HBURD"     "EP_NOHSDP"    "MP_NOHSDP"    "EP_UNINSUR"  
##  [56] "MP_UNINSUR"   "EP_AGE65"     "MP_AGE65"     "EP_AGE17"     "MP_AGE17"    
##  [61] "EP_DISABL"    "MP_DISABL"    "EP_SNGPNT"    "MP_SNGPNT"    "EP_LIMENG"   
##  [66] "MP_LIMENG"    "EP_MINRTY"    "MP_MINRTY"    "EP_MUNIT"     "MP_MUNIT"    
##  [71] "EP_MOBILE"    "MP_MOBILE"    "EP_CROWD"     "MP_CROWD"     "EP_NOVEH"    
##  [76] "MP_NOVEH"     "EP_GROUPQ"    "MP_GROUPQ"    "EPL_POV150"   "EPL_UNEMP"   
##  [81] "EPL_HBURD"    "EPL_NOHSDP"   "EPL_UNINSUR"  "SPL_THEME1"   "RPL_THEME1"  
##  [86] "EPL_AGE65"    "EPL_AGE17"    "EPL_DISABL"   "EPL_SNGPNT"   "EPL_LIMENG"  
##  [91] "SPL_THEME2"   "RPL_THEME2"   "EPL_MINRTY"   "SPL_THEME3"   "RPL_THEME3"  
##  [96] "EPL_MUNIT"    "EPL_MOBILE"   "EPL_CROWD"    "EPL_NOVEH"    "EPL_GROUPQ"  
## [101] "SPL_THEME4"   "RPL_THEME4"   "SPL_THEMES"   "RPL_THEMES"   "F_POV150"    
## [106] "F_UNEMP"      "F_HBURD"      "F_NOHSDP"     "F_UNINSUR"    "F_THEME1"    
## [111] "F_AGE65"      "F_AGE17"      "F_DISABL"     "F_SNGPNT"     "F_LIMENG"    
## [116] "F_THEME2"     "F_MINRTY"     "F_THEME3"     "F_MUNIT"      "F_MOBILE"    
## [121] "F_CROWD"      "F_NOVEH"      "F_GROUPQ"     "F_THEME4"     "F_TOTAL"     
## [126] "E_DAYPOP"     "E_NOINT"      "M_NOINT"      "E_AFAM"       "M_AFAM"      
## [131] "E_HISP"       "M_HISP"       "E_ASIAN"      "M_ASIAN"      "E_AIAN"      
## [136] "M_AIAN"       "E_NHPI"       "M_NHPI"       "E_TWOMORE"    "M_TWOMORE"   
## [141] "E_OTHERRACE"  "M_OTHERRACE"  "EP_NOINT"     "MP_NOINT"     "EP_AFAM"     
## [146] "MP_AFAM"      "EP_HISP"      "MP_HISP"      "EP_ASIAN"     "MP_ASIAN"    
## [151] "EP_AIAN"      "MP_AIAN"      "EP_NHPI"      "MP_NHPI"      "EP_TWOMORE"  
## [156] "MP_TWOMORE"   "EP_OTHERRACE" "MP_OTHERRACE"
glimpse(svi_raw)
## Rows: 3,143
## Columns: 158
## $ ST           <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01…
## $ STATE        <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "A…
## $ ST_ABBR      <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL…
## $ STCNTY       <chr> "01001", "01003", "01005", "01007", "01009", "01011", "01…
## $ COUNTY       <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "Bullo…
## $ FIPS         <chr> "01001", "01003", "01005", "01007", "01009", "01011", "01…
## $ LOCATION     <chr> "Autauga County, Alabama", "Baldwin County, Alabama", "Ba…
## $ AREA_SQMI    <dbl> 594.4558, 1589.8353, 885.0076, 622.4693, 644.8904, 622.81…
## $ E_TOTPOP     <dbl> 55639, 218289, 25026, 22374, 57755, 10173, 19726, 114324,…
## $ M_TOTPOP     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ E_HU         <dbl> 23697, 116747, 12057, 9237, 24404, 4591, 10119, 53722, 16…
## $ M_HU         <dbl> 68, 180, 119, 82, 87, 60, 57, 190, 70, 68, 60, 42, 63, 33…
## $ E_HH         <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, 44572, 13582…
## $ M_HH         <dbl> 366, 1143, 338, 299, 430, 195, 282, 642, 418, 476, 358, 2…
## $ E_POV150     <dbl> 12611, 36413, 8965, 5730, 13624, 4128, 6133, 30323, 9277,…
## $ M_POV150     <dbl> 1349, 2802, 673, 907, 1151, 593, 794, 2125, 926, 842, 121…
## $ E_UNEMP      <dbl> 736, 4027, 649, 667, 1253, 139, 551, 3769, 548, 444, 1153…
## $ M_UNEMP      <dbl> 185, 714, 182, 260, 301, 107, 194, 560, 195, 170, 324, 15…
## $ E_HBURD      <dbl> 5029, 19350, 2305, 1580, 4060, 695, 1383, 9451, 2932, 196…
## $ M_HBURD      <dbl> 576, 1526, 298, 319, 431, 180, 218, 766, 368, 347, 482, 1…
## $ E_NOHSDP     <dbl> 4273, 14823, 4497, 3056, 6838, 1700, 1886, 11787, 4140, 3…
## $ M_NOHSDP     <dbl> 562, 1344, 387, 414, 499, 285, 286, 887, 488, 499, 686, 2…
## $ E_UNINSUR    <dbl> 4345, 20501, 2362, 1878, 5746, 1025, 1973, 11260, 3045, 2…
## $ M_UNINSUR    <dbl> 725, 1932, 311, 387, 695, 290, 380, 970, 454, 429, 804, 3…
## $ E_AGE65      <dbl> 8490, 44716, 4777, 3676, 10382, 1638, 4092, 20245, 6530, …
## $ M_AGE65      <dbl> 85, 109, 16, 125, 106, 19, 98, 127, 45, 67, 69, 0, 11, 58…
## $ E_AGE17      <dbl> 13143, 46993, 5222, 4584, 13372, 2279, 4401, 24769, 6946,…
## $ M_AGE17      <dbl> 49, 0, 29, 0, 85, 147, 146, 0, 68, 73, 27, 0, 27, 0, 35, …
## $ E_DISABL     <dbl> 9658, 30615, 4159, 3748, 8564, 1203, 3067, 22943, 6113, 4…
## $ M_DISABL     <dbl> 856, 1565, 287, 551, 775, 291, 317, 1157, 518, 522, 736, …
## $ E_SNGPNT     <dbl> 1608, 3317, 1029, 790, 1313, 499, 500, 2972, 895, 385, 10…
## $ M_SNGPNT     <dbl> 302, 547, 217, 252, 262, 200, 152, 387, 242, 130, 309, 12…
## $ E_LIMENG     <dbl> 363, 1593, 433, 75, 801, 150, 59, 754, 111, 275, 719, 25,…
## $ M_LIMENG     <dbl> 229, 486, 205, 107, 263, 114, 88, 226, 110, 181, 328, 81,…
## $ E_MINRTY     <dbl> 14479, 37334, 13694, 5724, 7690, 8011, 9601, 32275, 15014…
## $ M_MINRTY     <dbl> 566, 1363, 266, 158, 343, 199, 140, 645, 216, 251, 620, 1…
## $ E_MUNIT      <dbl> 918, 19513, 170, 228, 167, 73, 114, 1977, 1432, 250, 184,…
## $ M_MUNIT      <dbl> 292, 979, 77, 112, 95, 79, 49, 299, 308, 105, 123, 31, 96…
## $ E_MOBILE     <dbl> 4313, 11893, 3644, 2943, 6043, 1919, 2349, 8364, 2281, 42…
## $ M_MOBILE     <dbl> 440, 868, 347, 327, 515, 271, 272, 657, 335, 390, 551, 31…
## $ E_CROWD      <dbl> 339, 1280, 350, 91, 385, 33, 72, 547, 974, 151, 455, 52, …
## $ M_CROWD      <dbl> 162, 364, 192, 58, 139, 34, 44, 147, 332, 77, 174, 46, 83…
## $ E_NOVEH      <dbl> 1167, 2627, 1039, 481, 1077, 418, 359, 2270, 1000, 343, 6…
## $ M_NOVEH      <dbl> 320, 397, 202, 226, 275, 192, 97, 317, 243, 110, 212, 126…
## $ E_GROUPQ     <dbl> 578, 2954, 2910, 1657, 564, 433, 320, 2979, 558, 274, 429…
## $ M_GROUPQ     <dbl> 157, 403, 264, 221, 144, 163, 139, 461, 163, 145, 134, 61…
## $ EP_POV150    <dbl> 22.9, 16.9, 40.6, 27.6, 23.8, 42.5, 31.6, 27.3, 28.2, 25.…
## $ MP_POV150    <dbl> 2.4, 1.3, 3.0, 4.4, 2.0, 6.0, 4.1, 1.9, 2.8, 3.3, 2.8, 3.…
## $ EP_UNEMP     <dbl> 2.9, 3.9, 6.9, 7.4, 5.2, 3.4, 6.5, 7.3, 3.6, 4.1, 6.1, 6.…
## $ MP_UNEMP     <dbl> 0.7, 0.7, 1.9, 2.8, 1.2, 2.5, 2.3, 1.1, 1.2, 1.6, 1.7, 3.…
## $ EP_HBURD     <dbl> 23.3, 23.0, 24.7, 21.8, 19.1, 20.3, 20.8, 21.2, 21.6, 18.…
## $ MP_HBURD     <dbl> 2.6, 1.8, 3.1, 4.3, 2.0, 5.1, 3.2, 1.7, 2.6, 3.1, 2.8, 3.…
## $ EP_NOHSDP    <dbl> 11.3, 9.5, 25.3, 19.1, 17.2, 25.1, 13.6, 14.9, 17.4, 17.2…
## $ MP_NOHSDP    <dbl> 1.5, 0.9, 2.2, 2.6, 1.3, 4.1, 2.1, 1.1, 2.0, 2.6, 2.3, 2.…
## $ EP_UNINSUR   <dbl> 8.0, 9.5, 10.7, 9.1, 10.0, 10.5, 10.1, 10.0, 9.2, 9.5, 11…
## $ MP_UNINSUR   <dbl> 1.3, 0.9, 1.4, 1.9, 1.2, 3.0, 1.9, 0.9, 1.4, 1.7, 1.8, 2.…
## $ EP_AGE65     <dbl> 15.3, 20.5, 19.1, 16.4, 18.0, 16.1, 20.7, 17.7, 19.5, 22.…
## $ MP_AGE65     <dbl> 0.2, 0.1, 0.1, 0.6, 0.2, 0.3, 0.5, 0.1, 0.1, 0.3, 0.2, 0.…
## $ EP_AGE17     <dbl> 23.6, 21.5, 20.9, 20.5, 23.2, 22.4, 22.3, 21.7, 20.8, 19.…
## $ MP_AGE17     <dbl> 0.1, 0.0, 0.1, 0.0, 0.1, 1.4, 0.7, 0.0, 0.2, 0.3, 0.1, 0.…
## $ EP_DISABL    <dbl> 17.7, 14.2, 18.8, 18.1, 15.0, 12.3, 15.8, 20.4, 18.5, 16.…
## $ MP_DISABL    <dbl> 1.6, 0.7, 1.4, 2.6, 1.4, 3.0, 1.6, 1.0, 1.6, 2.0, 1.7, 2.…
## $ EP_SNGPNT    <dbl> 7.5, 3.9, 11.0, 10.9, 6.2, 14.6, 7.5, 6.7, 6.6, 3.6, 6.4,…
## $ MP_SNGPNT    <dbl> 1.4, 0.6, 2.3, 3.4, 1.2, 5.8, 2.3, 0.9, 1.8, 1.2, 1.8, 2.…
## $ EP_LIMENG    <dbl> 0.7, 0.8, 1.8, 0.4, 1.5, 1.6, 0.3, 0.7, 0.4, 1.1, 1.7, 0.…
## $ MP_LIMENG    <dbl> 0.4, 0.2, 0.9, 0.5, 0.5, 1.2, 0.5, 0.2, 0.3, 0.7, 0.8, 0.…
## $ EP_MINRTY    <dbl> 26.0, 17.1, 54.7, 25.6, 13.3, 78.7, 48.7, 28.2, 44.9, 8.7…
## $ MP_MINRTY    <dbl> 1.0, 0.6, 1.1, 0.7, 0.6, 2.0, 0.7, 0.6, 0.6, 1.0, 1.4, 0.…
## $ EP_MUNIT     <dbl> 3.9, 16.7, 1.4, 2.5, 0.7, 1.6, 1.1, 3.7, 8.4, 1.5, 0.9, 0…
## $ MP_MUNIT     <dbl> 1.2, 0.8, 0.6, 1.2, 0.4, 1.7, 0.5, 0.6, 1.8, 0.6, 0.6, 0.…
## $ EP_MOBILE    <dbl> 18.2, 10.2, 30.2, 31.9, 24.8, 41.8, 23.2, 15.6, 13.4, 25.…
## $ MP_MOBILE    <dbl> 1.8, 0.7, 2.8, 3.5, 2.1, 5.9, 2.7, 1.2, 2.0, 2.3, 2.8, 4.…
## $ EP_CROWD     <dbl> 1.6, 1.5, 3.8, 1.3, 1.8, 1.0, 1.1, 1.2, 7.2, 1.4, 2.7, 1.…
## $ MP_CROWD     <dbl> 0.8, 0.4, 2.1, 0.8, 0.7, 1.0, 0.7, 0.3, 2.4, 0.7, 1.0, 0.…
## $ EP_NOVEH     <dbl> 5.4, 3.1, 11.1, 6.6, 5.1, 12.2, 5.4, 5.1, 7.4, 3.2, 3.9, …
## $ MP_NOVEH     <dbl> 1.5, 0.5, 2.1, 3.1, 1.3, 5.5, 1.4, 0.7, 1.8, 1.0, 1.2, 2.…
## $ EP_GROUPQ    <dbl> 1.0, 1.4, 11.6, 7.4, 1.0, 4.3, 1.6, 2.6, 1.7, 1.1, 1.0, 1…
## $ MP_GROUPQ    <dbl> 0.3, 0.2, 1.1, 1.0, 0.2, 1.6, 0.7, 0.4, 0.5, 0.6, 0.3, 0.…
## $ EPL_POV150   <dbl> 0.4624, 0.1770, 0.9586, 0.6820, 0.5010, 0.9694, 0.8119, 0…
## $ EPL_UNEMP    <dbl> 0.1397, 0.2861, 0.8071, 0.8485, 0.5586, 0.2005, 0.7689, 0…
## $ EPL_HBURD    <dbl> 0.6038, 0.5808, 0.7075, 0.4831, 0.2651, 0.3628, 0.4055, 0…
## $ EPL_NOHSDP   <dbl> 0.5080, 0.3647, 0.9688, 0.8654, 0.8027, 0.9666, 0.6464, 0…
## $ EPL_UNINSUR  <dbl> 0.4586, 0.5767, 0.6680, 0.5471, 0.6181, 0.6563, 0.6254, 0…
## $ SPL_THEME1   <dbl> 2.1725, 1.9853, 4.1100, 3.4261, 2.7455, 3.1556, 3.2581, 3…
## $ RPL_THEME1   <dbl> 0.3838, 0.3253, 0.9567, 0.8008, 0.5757, 0.7183, 0.7486, 0…
## $ EPL_AGE65    <dbl> 0.1827, 0.6496, 0.5216, 0.2623, 0.4090, 0.2400, 0.6728, 0…
## $ EPL_AGE17    <dbl> 0.6986, 0.4077, 0.3316, 0.2903, 0.6489, 0.5369, 0.5220, 0…
## $ EPL_DISABL   <dbl> 0.6830, 0.3689, 0.7581, 0.7101, 0.4421, 0.2011, 0.5191, 0…
## $ EPL_SNGPNT   <dbl> 0.7922, 0.1738, 0.9656, 0.9637, 0.6108, 0.9949, 0.7922, 0…
## $ EPL_LIMENG   <dbl> 0.4736, 0.5150, 0.7460, 0.3116, 0.6986, 0.7139, 0.2416, 0…
## $ SPL_THEME2   <dbl> 2.8301, 2.1150, 3.3229, 2.5380, 2.8094, 2.6868, 2.7477, 2…
## $ RPL_THEME2   <dbl> 0.7362, 0.2724, 0.9453, 0.5512, 0.7225, 0.6515, 0.6887, 0…
## $ EPL_MINRTY   <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.9809, 0.8686, 0…
## $ SPL_THEME3   <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.9809, 0.8686, 0…
## $ RPL_THEME3   <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.9809, 0.8686, 0…
## $ EPL_MUNIT    <dbl> 0.6050, 0.9574, 0.2565, 0.4402, 0.1187, 0.2909, 0.1983, 0…
## $ EPL_MOBILE   <dbl> 0.7486, 0.4885, 0.9427, 0.9554, 0.8803, 0.9914, 0.8571, 0…
## $ EPL_CROWD    <dbl> 0.4023, 0.3600, 0.8549, 0.2759, 0.4742, 0.1680, 0.1996, 0…
## $ EPL_NOVEH    <dbl> 0.4764, 0.1257, 0.9456, 0.6556, 0.4281, 0.9609, 0.4764, 0…
## $ EPL_GROUPQ   <dbl> 0.1569, 0.3157, 0.9513, 0.8873, 0.1569, 0.7766, 0.3873, 0…
## $ SPL_THEME4   <dbl> 2.3892, 2.2473, 3.9510, 3.2144, 2.0582, 3.1878, 2.1187, 2…
## $ RPL_THEME4   <dbl> 0.4309, 0.3612, 0.9949, 0.8622, 0.2743, 0.8501, 0.2982, 0…
## $ SPL_THEMES   <dbl> 8.0255, 6.8498, 12.2801, 9.8077, 8.0278, 10.0111, 8.9931,…
## $ RPL_THEMES   <dbl> 0.5130, 0.3103, 0.9927, 0.8078, 0.5137, 0.8310, 0.6843, 0…
## $ F_POV150     <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_UNEMP      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ F_HBURD      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_NOHSDP     <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_UNINSUR    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_THEME1     <dbl> 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ F_AGE65      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_AGE17      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_DISABL     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, …
## $ F_SNGPNT     <dbl> 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ F_LIMENG     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_THEME2     <dbl> 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, …
## $ F_MINRTY     <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_THEME3     <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_MUNIT      <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_MOBILE     <dbl> 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, …
## $ F_CROWD      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_NOVEH      <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_GROUPQ     <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_THEME4     <dbl> 0, 1, 3, 1, 0, 2, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, …
## $ F_TOTAL      <dbl> 0, 1, 6, 2, 0, 6, 0, 0, 1, 0, 1, 2, 2, 1, 1, 0, 0, 2, 2, …
## $ E_DAYPOP     <dbl> 41810, 218607, 27133, 18799, 42172, 7990, 18138, 118672, …
## $ E_NOINT      <dbl> 7100, 24453, 6249, 3839, 8987, 3174, 4683, 18425, 6618, 4…
## $ M_NOINT      <dbl> 913, 2477, 675, 626, 1175, 700, 645, 1731, 745, 589, 844,…
## $ E_AFAM       <dbl> 10849, 19027, 11889, 4971, 771, 6980, 8773, 23669, 13372,…
## $ M_AFAM       <dbl> 345, 744, 179, 111, 167, 140, 76, 380, 127, 120, 420, 79,…
## $ E_HISP       <dbl> 1601, 9947, 1110, 600, 5362, 824, 290, 4406, 843, 432, 34…
## $ M_HISP       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 63, 124, 0, 0, 0, 0, 99,…
## $ E_ASIAN      <dbl> 649, 2033, 122, 56, 236, 137, 261, 919, 368, 26, 195, 25,…
## $ M_ASIAN      <dbl> 174, 352, 25, 81, 41, 117, 41, 186, 50, 35, 55, 32, 27, 3…
## $ E_AIAN       <dbl> 155, 1327, 81, 12, 49, 0, 66, 168, 81, 150, 11, 7, 18, 0,…
## $ M_AIAN       <dbl> 102, 371, 63, 25, 41, 19, 52, 66, 44, 124, 36, 10, 27, 19…
## $ E_NHPI       <dbl> 0, 10, 1, 0, 55, 0, 0, 327, 9, 0, 0, 0, 0, 0, 0, 0, 57, 0…
## $ M_NHPI       <dbl> 29, 16, 2, 23, 60, 19, 19, 135, 16, 23, 26, 19, 23, 19, 1…
## $ E_TWOMORE    <dbl> 1124, 4250, 334, 85, 1038, 68, 211, 2610, 267, 398, 797, …
## $ M_TWOMORE    <dbl> 374, 871, 149, 67, 181, 76, 94, 414, 139, 176, 411, 53, 1…
## $ E_OTHERRACE  <dbl> 101, 740, 157, 0, 179, 2, 0, 176, 74, 0, 170, 0, 47, 5, 1…
## $ M_OTHERRACE  <dbl> 145, 534, 110, 23, 225, 4, 19, 209, 83, 23, 187, 19, 67, …
## $ EP_NOINT     <dbl> 12.9, 11.4, 28.3, 18.5, 15.7, 32.6, 24.1, 16.5, 20.1, 16.…
## $ MP_NOINT     <dbl> 0.0, 0.0, 0.3, 0.2, 0.0, 0.6, 0.2, 0.1, 0.1, 0.1, 0.1, 0.…
## $ EP_AFAM      <dbl> 19.5, 8.7, 47.5, 22.2, 1.3, 68.6, 44.5, 20.7, 40.0, 4.8, …
## $ MP_AFAM      <dbl> 0.6, 0.3, 0.7, 0.5, 0.3, 1.4, 0.4, 0.3, 0.4, 0.5, 1.0, 0.…
## $ EP_HISP      <dbl> 2.9, 4.6, 4.4, 2.7, 9.3, 8.1, 1.5, 3.9, 2.5, 1.7, 7.8, 0.…
## $ MP_HISP      <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.…
## $ EP_ASIAN     <dbl> 1.2, 0.9, 0.5, 0.3, 0.4, 1.3, 1.3, 0.8, 1.1, 0.1, 0.4, 0.…
## $ MP_ASIAN     <dbl> 0.3, 0.2, 0.1, 0.4, 0.1, 1.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.…
## $ EP_AIAN      <dbl> 0.3, 0.6, 0.3, 0.1, 0.1, 0.0, 0.3, 0.1, 0.2, 0.6, 0.0, 0.…
## $ MP_AIAN      <dbl> 0.2, 0.2, 0.3, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.…
## $ EP_NHPI      <dbl> 0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.3, 0.0, 0.0, 0.0, 0.…
## $ MP_NHPI      <dbl> 0.1, 0.1, 0.1, 0.2, 0.1, 0.3, 0.2, 0.1, 0.1, 0.1, 0.1, 0.…
## $ EP_TWOMORE   <dbl> 2.0, 1.9, 1.3, 0.4, 1.8, 0.7, 1.1, 2.3, 0.8, 1.5, 1.8, 0.…
## $ MP_TWOMORE   <dbl> 0.7, 0.4, 0.6, 0.3, 0.3, 0.7, 0.5, 0.4, 0.4, 0.7, 0.9, 0.…
## $ EP_OTHERRACE <dbl> 0.2, 0.3, 0.6, 0.0, 0.3, 0.0, 0.0, 0.2, 0.2, 0.0, 0.4, 0.…
## $ MP_OTHERRACE <dbl> 0.3, 0.2, 0.4, 0.2, 0.4, 0.1, 0.2, 0.2, 0.2, 0.1, 0.4, 0.…

3.2 3.2 Clean SVI Data (Create county_fips, select key variables)

svi_county <- svi_raw %>%
  # standardize column names
  rename_with(tolower) %>%
  
  # create standardized 5-digit county FIPS (same format as FCC)
  mutate(
    county_fips = str_pad(as.character(fips), 5, "left", "0")
  ) %>%
  
  # keep only the variables needed for analysis
  select(
    county_fips,
    state = st_abbr,
    county,
    svi_overall = rpl_themes,
    svi_soc     = rpl_theme1,
    svi_hh      = rpl_theme2,
    svi_min     = rpl_theme3,
    svi_hous    = rpl_theme4
  )

head(svi_county)
## # A tibble: 6 × 8
##   county_fips state county  svi_overall svi_soc svi_hh svi_min svi_hous
##   <chr>       <chr> <chr>         <dbl>   <dbl>  <dbl>   <dbl>    <dbl>
## 1 01001       AL    Autauga       0.513   0.384  0.736   0.634    0.431
## 2 01003       AL    Baldwin       0.310   0.325  0.272   0.502    0.361
## 3 01005       AL    Barbour       0.993   0.957  0.945   0.896    0.995
## 4 01007       AL    Bibb          0.808   0.801  0.551   0.629    0.862
## 5 01009       AL    Blount        0.514   0.576  0.722   0.415    0.274
## 6 01011       AL    Bullock       0.831   0.718  0.652   0.981    0.850
summary(svi_county)
##  county_fips           state              county           svi_overall    
##  Length:3143        Length:3143        Length:3143        Min.   :0.0000  
##  Class :character   Class :character   Class :character   1st Qu.:0.2500  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.4997  
##                                                           Mean   :0.5000  
##                                                           3rd Qu.:0.7500  
##                                                           Max.   :1.0000  
##     svi_soc         svi_hh          svi_min          svi_hous     
##  Min.   :0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.25   1st Qu.:0.2500   1st Qu.:0.2454   1st Qu.:0.2500  
##  Median :0.50   Median :0.4997   Median :0.4981   Median :0.5000  
##  Mean   :0.50   Mean   :0.5000   Mean   :0.4988   Mean   :0.5000  
##  3rd Qu.:0.75   3rd Qu.:0.7500   3rd Qu.:0.7495   3rd Qu.:0.7499  
##  Max.   :1.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

3.3 3.3 Save Cleaned SVI Dataset

write_csv(svi_county, path_processed("svi_2020_county_clean.csv"))
saveRDS(svi_county, path_processed("svi_2020_county_clean.rds"))

3.4 3.4 SVI Validation and Consistency Checks

After cleaning the SVI data and keeping only the overall index and four theme scores, we validated the SVI variables to ensure they are suitable for analysis. We checked that all SVI values fall within the expected 0–1 percentile range, identified any missing values, and examined correlations between the overall SVI score and each theme.

library(dplyr)

# If not already in memory, load the cleaned SVI
svi_2020_county_clean <- readRDS("processed_data/svi_2020_county_clean.rds")

# Keep only SVI numeric columns for checks
svi_check <- svi_2020_county_clean %>%
  select(
    county_fips,
    svi_overall,
    svi_soc,
    svi_hh,
    svi_min,
    svi_hous
  )

# 1) Range validation: SVI percentiles should be between 0 and 1
summary(svi_check)
##  county_fips         svi_overall        svi_soc         svi_hh      
##  Length:3143        Min.   :0.0000   Min.   :0.00   Min.   :0.0000  
##  Class :character   1st Qu.:0.2500   1st Qu.:0.25   1st Qu.:0.2500  
##  Mode  :character   Median :0.4997   Median :0.50   Median :0.4997  
##                     Mean   :0.5000   Mean   :0.50   Mean   :0.5000  
##                     3rd Qu.:0.7500   3rd Qu.:0.75   3rd Qu.:0.7500  
##                     Max.   :1.0000   Max.   :1.00   Max.   :1.0000  
##     svi_min          svi_hous     
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2454   1st Qu.:0.2500  
##  Median :0.4981   Median :0.5000  
##  Mean   :0.4988   Mean   :0.5000  
##  3rd Qu.:0.7495   3rd Qu.:0.7499  
##  Max.   :1.0000   Max.   :1.0000
invalid_svi <- svi_check %>%
  filter(
    svi_overall < 0 | svi_overall > 1 |
    svi_soc     < 0 | svi_soc     > 1 |
    svi_hh      < 0 | svi_hh      > 1 |
    svi_min     < 0 | svi_min     > 1 |
    svi_hous    < 0 | svi_hous    > 1
  )

invalid_svi   # ideally this is an empty tibble
## # A tibble: 0 × 6
## # ℹ 6 variables: county_fips <chr>, svi_overall <dbl>, svi_soc <dbl>,
## #   svi_hh <dbl>, svi_min <dbl>, svi_hous <dbl>
# 2) Missing values check for SVI columns
colSums(is.na(svi_check))
## county_fips svi_overall     svi_soc      svi_hh     svi_min    svi_hous 
##           0           0           0           0           0           0

4 4. Merge FCC and SVI Data

In this section, we combine the cleaned FCC broadband dataset (December 2020) with the cleaned SVI 2020 county dataset using 5-digit county FIPS codes.

4.1 4.1 Load Cleaned FCC and SVI Datasets

fcc_2020 <- readRDS(path_processed("fcc_2020_dec_clean.rds"))
svi_2020 <- readRDS(path_processed("svi_2020_county_clean.rds"))

head(fcc_2020)
## # A tibble: 6 × 7
##   county_fips state_name county_name    housing_units tier1 tier2 tier3
##   <chr>       <chr>      <chr>                  <dbl> <dbl> <dbl> <dbl>
## 1 01001       Alabama    Autauga County         24165     4     4     4
## 2 01003       Alabama    Baldwin County        122518     4     3     3
## 3 01005       Alabama    Barbour County         12120     3     3     2
## 4 01007       Alabama    Bibb County             9340     3     2     1
## 5 01009       Alabama    Blount County          24625     3     3     2
## 6 01011       Alabama    Bullock County          4625     3     3     1
head(svi_2020)
## # A tibble: 6 × 8
##   county_fips state county  svi_overall svi_soc svi_hh svi_min svi_hous
##   <chr>       <chr> <chr>         <dbl>   <dbl>  <dbl>   <dbl>    <dbl>
## 1 01001       AL    Autauga       0.513   0.384  0.736   0.634    0.431
## 2 01003       AL    Baldwin       0.310   0.325  0.272   0.502    0.361
## 3 01005       AL    Barbour       0.993   0.957  0.945   0.896    0.995
## 4 01007       AL    Bibb          0.808   0.801  0.551   0.629    0.862
## 5 01009       AL    Blount        0.514   0.576  0.722   0.415    0.274
## 6 01011       AL    Bullock       0.831   0.718  0.652   0.981    0.850

4.2 4.2 Identify FCC Counties Without Matching SVI Records

missing_svi <- fcc_2020 %>%
  anti_join(svi_2020, by = "county_fips")

head(missing_svi)
## # A tibble: 6 × 7
##   county_fips state_name     county_name      housing_units tier1 tier2 tier3
##   <chr>       <chr>          <chr>                    <dbl> <dbl> <dbl> <dbl>
## 1 60010       American Samoa Eastern District          4490     3     3     3
## 2 60020       American Samoa Manu'a District            376     3     3     3
## 3 60030       American Samoa Rose Island                  0     0     0     0
## 4 60040       American Samoa Swains Island                7     0     0     0
## 5 60050       American Samoa Western District          6090     4     3     3
## 6 66010       Guam           Guam                     50567     4     4     3
nrow(missing_svi)
## [1] 91

The missing_svi table shows FCC counties and territories that appear in the broadband data but do not have corresponding SVI 2020 county records. These are primarily U.S. territories such as Puerto Rico, Guam, American Samoa, and the U.S. Virgin Islands. Because SVI is not available for these areas, they will be excluded from our final merged dataset. ## 4.3 Create Final Merged Dataset (FCC + SVI)

merged_data <- fcc_2020 %>%
  inner_join(svi_2020, by = "county_fips")

# Inspect merged dataset
nrow(merged_data)
## [1] 3143
head(merged_data)
## # A tibble: 6 × 14
##   county_fips state_name county_name    housing_units tier1 tier2 tier3 state
##   <chr>       <chr>      <chr>                  <dbl> <dbl> <dbl> <dbl> <chr>
## 1 01001       Alabama    Autauga County         24165     4     4     4 AL   
## 2 01003       Alabama    Baldwin County        122518     4     3     3 AL   
## 3 01005       Alabama    Barbour County         12120     3     3     2 AL   
## 4 01007       Alabama    Bibb County             9340     3     2     1 AL   
## 5 01009       Alabama    Blount County          24625     3     3     2 AL   
## 6 01011       Alabama    Bullock County          4625     3     3     1 AL   
## # ℹ 6 more variables: county <chr>, svi_overall <dbl>, svi_soc <dbl>,
## #   svi_hh <dbl>, svi_min <dbl>, svi_hous <dbl>
summary(merged_data)
##  county_fips         state_name        county_name        housing_units    
##  Length:3143        Length:3143        Length:3143        Min.   :     56  
##  Class :character   Class :character   Class :character   1st Qu.:   5570  
##  Mode  :character   Mode  :character   Mode  :character   Median :  12666  
##                                                           Mean   :  44790  
##                                                           3rd Qu.:  31824  
##                                                           Max.   :3599420  
##      tier1          tier2           tier3          state          
##  Min.   :0.00   Min.   :0.000   Min.   :0.000   Length:3143       
##  1st Qu.:3.00   1st Qu.:3.000   1st Qu.:2.000   Class :character  
##  Median :4.00   Median :3.000   Median :3.000   Mode  :character  
##  Mean   :3.72   Mean   :3.308   Mean   :2.753                     
##  3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:4.000                     
##  Max.   :5.00   Max.   :5.000   Max.   :5.000                     
##     county           svi_overall        svi_soc         svi_hh      
##  Length:3143        Min.   :0.0000   Min.   :0.00   Min.   :0.0000  
##  Class :character   1st Qu.:0.2500   1st Qu.:0.25   1st Qu.:0.2500  
##  Mode  :character   Median :0.4997   Median :0.50   Median :0.4997  
##                     Mean   :0.5000   Mean   :0.50   Mean   :0.5000  
##                     3rd Qu.:0.7500   3rd Qu.:0.75   3rd Qu.:0.7500  
##                     Max.   :1.0000   Max.   :1.00   Max.   :1.0000  
##     svi_min          svi_hous     
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2454   1st Qu.:0.2500  
##  Median :0.4981   Median :0.5000  
##  Mean   :0.4988   Mean   :0.5000  
##  3rd Qu.:0.7495   3rd Qu.:0.7499  
##  Max.   :1.0000   Max.   :1.0000

4.3 4.4 Save Merged Dataset

write_csv(merged_data, path_processed("merged_fcc_svi_2020.csv"))
saveRDS(merged_data, path_processed("merged_fcc_svi_2020.rds"))

5 5. Microsoft Airband Broadband Usage Data

Microsoft’s Airband dataset provides estimates of actual broadband usage based on Windows device telemetry. While FCC data describes where broadband service is available (deployment), Airband data describes where broadband is actually used (adoption). Together, these allow us to compare access (FCC) and usage (Airband) across U.S. counties.

5.1 5.1 Load and Inspect Airband Raw Data

airband_raw <- read_csv(
  path_raw("microsoft", "broadband_data_2020.csv")
)

names(airband_raw)
## [1] "ST"                             "COUNTY ID"                     
## [3] "COUNTY NAME"                    "BROADBAND AVAILABILITY PER FCC"
## [5] "BROADBAND USAGE"
glimpse(airband_raw)
## Rows: 3,142
## Columns: 5
## $ ST                               <chr> "AL", "AL", "AL", "AL", "AL", "AL", "…
## $ `COUNTY ID`                      <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1…
## $ `COUNTY NAME`                    <chr> "Autauga County", "Baldwin County", "…
## $ `BROADBAND AVAILABILITY PER FCC` <chr> "0.8057", "0.8362", "0.6891", "0.3368…
## $ `BROADBAND USAGE`                <chr> "0.391", "0.452", "0.324", "0.136", "…

5.2 5.2 Clean Airband Data

airband_clean <- airband_raw %>%
  # Standardize column names
  rename_with(~ .x |>
                str_trim() |>
                str_replace_all("\\s+", "_") |>
                str_to_lower()) %>%
  # Now expect columns:
  # st, county_id, county_name, broadband_availability_per_fcc, broadband_usage
  mutate(
    county_fips = str_pad(as.character(county_id), 5, pad = "0")
  ) %>%
  select(
    county_fips,
    state = st,
    county_name,
    airband_fcc_availability = broadband_availability_per_fcc,
    airband_usage = broadband_usage
  )

# Inspect
names(airband_clean)
## [1] "county_fips"              "state"                   
## [3] "county_name"              "airband_fcc_availability"
## [5] "airband_usage"
head(airband_clean)
## # A tibble: 6 × 5
##   county_fips state county_name    airband_fcc_availability airband_usage
##   <chr>       <chr> <chr>          <chr>                    <chr>        
## 1 01001       AL    Autauga County 0.8057                   0.391        
## 2 01003       AL    Baldwin County 0.8362                   0.452        
## 3 01005       AL    Barbour County 0.6891                   0.324        
## 4 01007       AL    Bibb County    0.3368                   0.136        
## 5 01009       AL    Blount County  0.758                    0.199        
## 6 01011       AL    Bullock County 0.9363                   0.157
summary(airband_clean$airband_usage)
##    Length     Class      Mode 
##      3142 character character

5.3 5.3 Airband Validation Checks

# Load cleaned Microsoft Airband dataset
airband_2020 <- readRDS("processed_data/airband_2020_county_clean.rds")

# Overview of structure
glimpse(airband_2020)
## Rows: 3,142
## Columns: 5
## $ county_fips              <chr> "01001", "01003", "01005", "01007", "01009", …
## $ state                    <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL…
## $ county_name              <chr> "Autauga County", "Baldwin County", "Barbour …
## $ airband_fcc_availability <chr> "0.8057", "0.8362", "0.6891", "0.3368", "0.75…
## $ airband_usage            <chr> "0.391", "0.452", "0.324", "0.136", "0.199", …
# Use county_fips (not FIPS)
airband_check <- airband_2020 %>%
  select(
    county_fips,
    airband_fcc_availability,
    airband_usage
  )

# Convert availability and usage to numeric if stored as characters
airband_check <- airband_check %>%
  mutate(
    airband_fcc_availability = as.numeric(airband_fcc_availability),
    airband_usage            = as.numeric(airband_usage)
  )

# 1) Summary statistics
summary(airband_check)
##  county_fips        airband_fcc_availability airband_usage   
##  Length:3142        Min.   :0.0002           Min.   :0.0010  
##  Class :character   1st Qu.:0.7700           1st Qu.:0.1960  
##  Mode  :character   Median :0.9242           Median :0.3680  
##                     Mean   :0.8404           Mean   :0.3896  
##                     3rd Qu.:0.9839           3rd Qu.:0.5660  
##                     Max.   :1.0000           Max.   :1.0000  
##                     NA's   :9                NA's   :11
# 2) Range validation
invalid_airband_range <- airband_check %>%
  filter(
    airband_fcc_availability < 0 | airband_fcc_availability > 1 |
    airband_usage < 0 | airband_usage > 1
  )

invalid_airband_range   # ideally 0 rows
## # A tibble: 0 × 3
## # ℹ 3 variables: county_fips <chr>, airband_fcc_availability <dbl>,
## #   airband_usage <dbl>
# 3) Missingness check
colSums(is.na(airband_check))
##              county_fips airband_fcc_availability            airband_usage 
##                        0                        9                       11
# 4) Consistency check: usage > availability
airband_usage_gt_avail <- airband_check %>%
  filter(
    !is.na(airband_fcc_availability),
    !is.na(airband_usage),
    airband_usage > airband_fcc_availability
  )

# Show examples
airband_usage_gt_avail %>% head()
## # A tibble: 6 × 3
##   county_fips airband_fcc_availability airband_usage
##   <chr>                          <dbl>         <dbl>
## 1 01063                         0.0085         0.013
## 2 01105                         0.0012         0.032
## 3 02185                         0.0061         0.051
## 4 02188                         0.0081         0.039
## 5 02198                         0.0282         0.04 
## 6 02290                         0.013          0.139
nrow(airband_usage_gt_avail)
## [1] 50

Airband Data Validation Summary

The Microsoft Airband dataset was validated after cleaning and aggregation to the county level. Both broadband indicators—modeled FCC availability and estimated household usage—fall entirely within the expected 0–1 range, with missing values limited to counties in Alaska and U.S. territories where Airband suppresses estimates. A subset of counties exhibit usage values that exceed modeled availability. This behavior is well-documented in the Airband methodology and reflects differences between modeled infrastructure coverage and actual adoption (including satellite-based broadband). These records were retained, as they represent genuine modeling differences rather than data errors. (# Need to check this)

5.4 5.4 Save Cleaned Airband Dataset

write_csv(airband_clean, path_processed("airband_2020_county_clean.csv"))
saveRDS(airband_clean, path_processed("airband_2020_county_clean.rds"))

5.5 6.1 Load Cleaned FCC, SVI, and Airband Datasets

fcc_2020      <- readRDS(path_processed("fcc_2020_dec_clean.rds"))
svi_2020      <- readRDS(path_processed("svi_2020_county_clean.rds"))
airband_clean <- readRDS(path_processed("airband_2020_county_clean.rds"))

# Row counts for confirmation
nrow(fcc_2020)
## [1] 3234
nrow(svi_2020)
## [1] 3143
nrow(airband_clean)
## [1] 3142

5.6 6.2 Identify Counties Missing Airband Data

missing_airband <- merged_data %>%
  anti_join(airband_clean, by = "county_fips")

head(missing_airband)
## # A tibble: 2 × 14
##   county_fips state_name county_name       housing_units tier1 tier2 tier3 state
##   <chr>       <chr>      <chr>                     <dbl> <dbl> <dbl> <dbl> <chr>
## 1 02063       Alaska     Chugach Census A…          3588     0     0     0 AK   
## 2 02066       Alaska     Copper River Cen…          2632     0     0     0 AK   
## # ℹ 6 more variables: county <chr>, svi_overall <dbl>, svi_soc <dbl>,
## #   svi_hh <dbl>, svi_min <dbl>, svi_hous <dbl>
nrow(missing_airband)
## [1] 2

5.7 6.3 Create Final Master Dataset (FCC + SVI + Airband)

merged_all <- merged_data %>%
  inner_join(airband_clean, by = "county_fips")

# Inspect final dataset
nrow(merged_all)
## [1] 3141
head(merged_all)
## # A tibble: 6 × 18
##   county_fips state_name county_name.x  housing_units tier1 tier2 tier3 state.x
##   <chr>       <chr>      <chr>                  <dbl> <dbl> <dbl> <dbl> <chr>  
## 1 01001       Alabama    Autauga County         24165     4     4     4 AL     
## 2 01003       Alabama    Baldwin County        122518     4     3     3 AL     
## 3 01005       Alabama    Barbour County         12120     3     3     2 AL     
## 4 01007       Alabama    Bibb County             9340     3     2     1 AL     
## 5 01009       Alabama    Blount County          24625     3     3     2 AL     
## 6 01011       Alabama    Bullock County          4625     3     3     1 AL     
## # ℹ 10 more variables: county <chr>, svi_overall <dbl>, svi_soc <dbl>,
## #   svi_hh <dbl>, svi_min <dbl>, svi_hous <dbl>, state.y <chr>,
## #   county_name.y <chr>, airband_fcc_availability <chr>, airband_usage <chr>
summary(merged_all)
##  county_fips         state_name        county_name.x      housing_units    
##  Length:3141        Length:3141        Length:3141        Min.   :     56  
##  Class :character   Class :character   Class :character   1st Qu.:   5586  
##  Mode  :character   Mode  :character   Mode  :character   Median :  12666  
##                                                           Mean   :  44817  
##                                                           3rd Qu.:  31875  
##                                                           Max.   :3599420  
##      tier1           tier2          tier3         state.x         
##  Min.   :0.000   Min.   :0.00   Min.   :0.000   Length:3141       
##  1st Qu.:3.000   1st Qu.:3.00   1st Qu.:2.000   Class :character  
##  Median :4.000   Median :3.00   Median :3.000   Mode  :character  
##  Mean   :3.722   Mean   :3.31   Mean   :2.755                     
##  3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.:4.000                     
##  Max.   :5.000   Max.   :5.00   Max.   :5.000                     
##     county           svi_overall        svi_soc           svi_hh      
##  Length:3141        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:0.2502   1st Qu.:0.2502   1st Qu.:0.2502  
##  Mode  :character   Median :0.5003   Median :0.5003   Median :0.5003  
##                     Mean   :0.5001   Mean   :0.5002   Mean   :0.5002  
##                     3rd Qu.:0.7502   3rd Qu.:0.7502   3rd Qu.:0.7502  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     svi_min          svi_hous        state.y          county_name.y     
##  Min.   :0.0000   Min.   :0.0000   Length:3141        Length:3141       
##  1st Qu.:0.2454   1st Qu.:0.2498   Class :character   Class :character  
##  Median :0.4981   Median :0.5000   Mode  :character   Mode  :character  
##  Mean   :0.4986   Mean   :0.4999                                        
##  3rd Qu.:0.7495   3rd Qu.:0.7495                                        
##  Max.   :1.0000   Max.   :1.0000                                        
##  airband_fcc_availability airband_usage     
##  Length:3141              Length:3141       
##  Class :character         Class :character  
##  Mode  :character         Mode  :character  
##                                             
##                                             
## 

5.8 6.4 Save Final Merged Dataset

write_csv(merged_all, path_processed("merged_fcc_svi_airband_2020.csv"))
saveRDS(merged_all, path_processed("merged_fcc_svi_airband_2020.rds"))

6 7. County-Level Shapefile Processing

To conduct spatial analysis and merge all county-level datasets consistently, we first processed the 2020 U.S. Census TIGER/Line county shapefile. This shapefile provides the geographic boundary polygons and the official county FIPS codes (GEOID) used as the primary spatial join key in later stages of the project.

7 7.1 Load Libraries for Spatial Data

sf::sf_use_s2(FALSE)

8 7.2 Read the County Shapefile

We now load the 2020 U.S. Census TIGER/Line county shapefile into R using the sf package. This file contains county boundary polygons and the GEOID field that will be used later for spatial joining with ACS, SVI, FCC, and Airband datasets.

shp_path <- "raw_data/geographic/tl_2020_us_county.shp"

counties_raw <- sf::st_read(shp_path)
## Reading layer `tl_2020_us_county' from data source 
##   `/Users/udaykiran/digital_divide_project/raw_data/geographic/tl_2020_us_county.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 3234 features and 17 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS:  NAD83

#7.3 Select Relevant Columns

The county shapefile contains many fields, but for this project we only need the attributes required for merging and mapping. Specifically, we keep:

GEOID – 5-digit county FIPS (primary join key)

NAME – county name

STATEFP – state FIPS code

COUNTYFP – county number within the state

geometry – polygon boundary information

counties_sf <- counties_raw %>% 
  dplyr::select(GEOID, NAME, STATEFP, COUNTYFP, geometry)

counties_sf
## Simple feature collection with 3234 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS:  NAD83
## First 10 features:
##    GEOID        NAME STATEFP COUNTYFP                       geometry
## 1  31039      Cuming      31      039 MULTIPOLYGON (((-97.01952 4...
## 2  53069   Wahkiakum      53      069 MULTIPOLYGON (((-123.4364 4...
## 3  35011     De Baca      35      011 MULTIPOLYGON (((-104.5674 3...
## 4  31109   Lancaster      31      109 MULTIPOLYGON (((-96.91075 4...
## 5  31129    Nuckolls      31      129 MULTIPOLYGON (((-98.27367 4...
## 6  72085 Las Piedras      72      085 MULTIPOLYGON (((-65.91048 1...
## 7  46099   Minnehaha      46      099 MULTIPOLYGON (((-96.89022 4...
## 8  48327      Menard      48      327 MULTIPOLYGON (((-99.82187 3...
## 9  06091      Sierra      06      091 MULTIPOLYGON (((-120.6556 3...
## 10 21053     Clinton      21      053 MULTIPOLYGON (((-85.2391 36...

9 7.4 Standardize the GEOID (County FIPS Code)

Different datasets (SVI, ACS, FCC, Airband) use FIPS codes to identify counties. To ensure all datasets merge correctly, we convert the GEOID column from the shapefile into a 5-digit character string with leading zeros.

Example:

1001 → 01001

3501 → 03501

This prevents merge errors and ensures consistency across all data sources.

counties_sf <- counties_sf %>% 
  mutate(
    GEOID = as.character(GEOID),              # convert to character
    GEOID = stringr::str_pad(GEOID,           # pad with leading zeros
                             width = 5, 
                             side = "left", 
                             pad = "0")
  )

# Display first 20 FIPS codes to verify correctness
head(counties_sf$GEOID, 20)
##  [1] "31039" "53069" "35011" "31109" "31129" "72085" "46099" "48327" "06091"
## [10] "21053" "39063" "48189" "01027" "48011" "39003" "13189" "55111" "05137"
## [19] "41063" "42007"

10 7.5 Re-project County Shapefile for Mapping

The TIGER/Line county shapefile is in the NAD83 coordinate reference system. For most visualization tools (e.g., ggplot2, tmap, leaflet), it is convenient to use WGS84 (EPSG:4326). Here, we reproject the spatial object once so it is ready for mapping and spatial analysis.

counties_sf <- st_transform(counties_sf, crs = 4326)

# Quick check of the CRS and object
counties_sf
## Simple feature collection with 3234 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS:  WGS 84
## First 10 features:
##    GEOID        NAME STATEFP COUNTYFP                       geometry
## 1  31039      Cuming      31      039 MULTIPOLYGON (((-97.01952 4...
## 2  53069   Wahkiakum      53      069 MULTIPOLYGON (((-123.4364 4...
## 3  35011     De Baca      35      011 MULTIPOLYGON (((-104.5674 3...
## 4  31109   Lancaster      31      109 MULTIPOLYGON (((-96.91075 4...
## 5  31129    Nuckolls      31      129 MULTIPOLYGON (((-98.27367 4...
## 6  72085 Las Piedras      72      085 MULTIPOLYGON (((-65.91048 1...
## 7  46099   Minnehaha      46      099 MULTIPOLYGON (((-96.89022 4...
## 8  48327      Menard      48      327 MULTIPOLYGON (((-99.82187 3...
## 9  06091      Sierra      06      091 MULTIPOLYGON (((-120.6556 3...
## 10 21053     Clinton      21      053 MULTIPOLYGON (((-85.2391 36...
st_crs(counties_sf)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         MEMBER["World Geodetic System 1984 (G2296)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

11 7.6 Save Cleaned County Shapefile

# Save Cleaned County Spatial Data as RDS

# Save directly into your existing processed_data folder
saveRDS(counties_sf, "processed_data/counties_sf.rds")

# Confirm it saved
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"      "acs_all_2020_county_clean.rds"     
##  [3] "acs_b15003_2020_county_clean.csv"   "acs_b15003_2020_county_clean.rds"  
##  [5] "acs_b19013_2020_county_clean.csv"   "acs_b19013_2020_county_clean.rds"  
##  [7] "acs_b28002_2020_county_clean.csv"   "acs_b28002_2020_county_clean.rds"  
##  [9] "airband_2020_county_clean.csv"      "airband_2020_county_clean.rds"     
## [11] "airband_inconsistency_counties.csv" "analysis_2020_county_final_sf.rds" 
## [13] "analysis_2020_county_final.csv"     "analysis_2020_county_final.rds"    
## [15] "analysis_2020_county_sf.rds"        "analysis_2020_county.csv"          
## [17] "analysis_2020_county.rds"           "counties_sf.rds"                   
## [19] "fcc_2020_dec_clean.csv"             "fcc_2020_dec_clean.rds"            
## [21] "master_2020_county_sf.rds"          "master_2020_county.csv"            
## [23] "master_2020_county.rds"             "merged_fcc_svi_2020.csv"           
## [25] "merged_fcc_svi_2020.rds"            "merged_fcc_svi_airband_2020.csv"   
## [27] "merged_fcc_svi_airband_2020.rds"    "svi_2020_county_clean.csv"         
## [29] "svi_2020_county_clean.rds"

12 8. Processing ACS (American Community Survey) Data

The American Community Survey (ACS) provides essential socioeconomic indicators that are critical for understanding digital inequality across U.S. counties. Unlike datasets that only measure broadband availability or usage, ACS tables contain detailed information about household income, educational attainment, and internet/computer access—all of which are deeply connected to digital divide outcomes.

13 8.1 Clean ACS B28002 (Internet & Computer Use)

The ACS table B28002 provides detailed information about internet and computer access at the county level. This dataset is essential for understanding digital adoption, device availability, and technology access across communities. However, the raw ACS download includes extra columns—such as margins of error (M columns), metadata, and a GEO_ID field containing long identifiers.

# Load ACS B28002
b28002_path <- "raw_data/census/ACSDT5Y2020.B28002-Data.csv"
acs_b28002_raw <- read_csv(b28002_path)

# Remove label row
acs_b28002_raw <- acs_b28002_raw %>% 
  filter(GEO_ID != "Geography")

# Keep only important columns
acs_b28002_clean <- acs_b28002_raw %>% 
  select(
    GEO_ID, NAME,
    B28002_001E,  # total households
    B28002_002E,  # broadband households
    B28002_012E,  # households with no internet
    B28002_013E   # households with no computer
    # OPTIONAL: add B28002_004E, _005E, _006E if needed
  ) %>%
  mutate(
    FIPS = str_sub(GEO_ID, -5, -1),
    FIPS = str_pad(FIPS, 5, pad = "0")
  ) %>%
  select(FIPS, GEO_ID, NAME, everything())

# Preview
head(acs_b28002_clean)
## # A tibble: 6 × 7
##   FIPS  GEO_ID         NAME      B28002_001E B28002_002E B28002_012E B28002_013E
##   <chr> <chr>          <chr>     <chr>       <chr>       <chr>       <chr>      
## 1 01001 0500000US01001 Autauga … 21559       17850       402         3307       
## 2 01003 0500000US01003 Baldwin … 84047       71880       2149        10018      
## 3 01005 0500000US01005 Barbour … 9322        6059        563         2700       
## 4 01007 0500000US01007 Bibb Cou… 7259        5529        210         1520       
## 5 01009 0500000US01009 Blount C… 21205       16971       314         3920       
## 6 01011 0500000US01011 Bullock … 3429        2153        236         1040

14 8.2 Save Cleaned B28002 Dataset

# Save to your existing processed_data folder
write_csv(acs_b28002_clean, "processed_data/acs_b28002_2020_county_clean.csv")
saveRDS(acs_b28002_clean,  "processed_data/acs_b28002_2020_county_clean.rds")

# Confirm that the files saved
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"      "acs_all_2020_county_clean.rds"     
##  [3] "acs_b15003_2020_county_clean.csv"   "acs_b15003_2020_county_clean.rds"  
##  [5] "acs_b19013_2020_county_clean.csv"   "acs_b19013_2020_county_clean.rds"  
##  [7] "acs_b28002_2020_county_clean.csv"   "acs_b28002_2020_county_clean.rds"  
##  [9] "airband_2020_county_clean.csv"      "airband_2020_county_clean.rds"     
## [11] "airband_inconsistency_counties.csv" "analysis_2020_county_final_sf.rds" 
## [13] "analysis_2020_county_final.csv"     "analysis_2020_county_final.rds"    
## [15] "analysis_2020_county_sf.rds"        "analysis_2020_county.csv"          
## [17] "analysis_2020_county.rds"           "counties_sf.rds"                   
## [19] "fcc_2020_dec_clean.csv"             "fcc_2020_dec_clean.rds"            
## [21] "master_2020_county_sf.rds"          "master_2020_county.csv"            
## [23] "master_2020_county.rds"             "merged_fcc_svi_2020.csv"           
## [25] "merged_fcc_svi_2020.rds"            "merged_fcc_svi_airband_2020.csv"   
## [27] "merged_fcc_svi_airband_2020.rds"    "svi_2020_county_clean.csv"         
## [29] "svi_2020_county_clean.rds"

15 8.3 Clean ACS B19013 (Median Household Income)

ACS table B19013 contains the median household income for each U.S. county. Income is one of the strongest predictors of digital access, affordability, and adoption — making it a critical variable in your digital divide analysis.

Because the raw ACS file includes:

metadata rows

margin of error (_M) columns

long GEO_ID fields

We must clean and standardize the dataset before merging it with other county-level data.

# 8.2 Load raw ACS B19013 data
b19013_path <- "raw_data/census/ACSDT5Y2020.B19013-Data.csv"
acs_b19013_raw <- read_csv(b19013_path)

# Remove the label row
acs_b19013_raw <- acs_b19013_raw %>% 
  filter(GEO_ID != "Geography")

# Keep only GEO_ID, NAME, and the income estimate column
acs_b19013_clean <- acs_b19013_raw %>% 
  select(
    GEO_ID,
    NAME,
    B19013_001E   # median household income
  ) %>% 
  mutate(
    FIPS = str_sub(GEO_ID, -5, -1),
    FIPS = str_pad(FIPS, 5, pad = "0")
  ) %>% 
  select(FIPS, GEO_ID, NAME, everything())

# Preview cleaned file
head(acs_b19013_clean)
## # A tibble: 6 × 4
##   FIPS  GEO_ID         NAME                    B19013_001E
##   <chr> <chr>          <chr>                   <chr>      
## 1 01001 0500000US01001 Autauga County, Alabama 57982      
## 2 01003 0500000US01003 Baldwin County, Alabama 61756      
## 3 01005 0500000US01005 Barbour County, Alabama 34990      
## 4 01007 0500000US01007 Bibb County, Alabama    51721      
## 5 01009 0500000US01009 Blount County, Alabama  48922      
## 6 01011 0500000US01011 Bullock County, Alabama 33866

16 8.4 Save the cleaned Median Income File

write_csv(acs_b19013_clean, "processed_data/acs_b19013_2020_county_clean.csv")
saveRDS(acs_b19013_clean,  "processed_data/acs_b19013_2020_county_clean.rds")

# Confirm saved files
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"      "acs_all_2020_county_clean.rds"     
##  [3] "acs_b15003_2020_county_clean.csv"   "acs_b15003_2020_county_clean.rds"  
##  [5] "acs_b19013_2020_county_clean.csv"   "acs_b19013_2020_county_clean.rds"  
##  [7] "acs_b28002_2020_county_clean.csv"   "acs_b28002_2020_county_clean.rds"  
##  [9] "airband_2020_county_clean.csv"      "airband_2020_county_clean.rds"     
## [11] "airband_inconsistency_counties.csv" "analysis_2020_county_final_sf.rds" 
## [13] "analysis_2020_county_final.csv"     "analysis_2020_county_final.rds"    
## [15] "analysis_2020_county_sf.rds"        "analysis_2020_county.csv"          
## [17] "analysis_2020_county.rds"           "counties_sf.rds"                   
## [19] "fcc_2020_dec_clean.csv"             "fcc_2020_dec_clean.rds"            
## [21] "master_2020_county_sf.rds"          "master_2020_county.csv"            
## [23] "master_2020_county.rds"             "merged_fcc_svi_2020.csv"           
## [25] "merged_fcc_svi_2020.rds"            "merged_fcc_svi_airband_2020.csv"   
## [27] "merged_fcc_svi_airband_2020.rds"    "svi_2020_county_clean.csv"         
## [29] "svi_2020_county_clean.rds"

17 8.5 Clean ACS B15003 (Educational Attainment)

ACS table B15003 provides counts of the population by educational attainment level. For digital divide research, education is one of the most important predictors of digital literacy, employment opportunities, and broadband adoption.

library(dplyr)
library(readr)
library(stringr)

# 8.3 Load raw ACS B15003 education data
b15003_path <- "raw_data/census/ACSDT5Y2020.B15003-Data.csv"
acs_b15003_raw <- read_csv(b15003_path)

# Remove descriptive header row
acs_b15003_raw <- acs_b15003_raw %>% 
  filter(GEO_ID != "Geography")

# Keep key educational variables
acs_b15003_clean <- acs_b15003_raw %>% 
  select(
    GEO_ID,
    NAME,
    B15003_001E,  # total population age 25+
    B15003_017E,  # high school graduate
    B15003_022E,  # bachelor's degree
    B15003_023E,  # master's degree
    B15003_025E   # doctorate degree
  ) %>%
  mutate(
    FIPS = str_sub(GEO_ID, -5, -1),
    FIPS = str_pad(FIPS, 5, pad = "0")
  ) %>%
  select(FIPS, GEO_ID, NAME, everything())

# Preview
head(acs_b15003_clean)
## # A tibble: 6 × 8
##   FIPS  GEO_ID NAME  B15003_001E B15003_017E B15003_022E B15003_023E B15003_025E
##   <chr> <chr>  <chr> <chr>       <chr>       <chr>       <chr>       <chr>      
## 1 01001 05000… Auta… 37860       9663        6320        3399        498        
## 2 01003 05000… Bald… 155563      34521       31444       13331       2016       
## 3 01005 05000… Barb… 17797       5171        1296        540         113        
## 4 01007 05000… Bibb… 15987       5879        1183        481         41         
## 5 01009 05000… Blou… 39814       10582       3540        1382        132        
## 6 01011 05000… Bull… 6784        2498        413         152         28

18 8.3.2 Save cleaned B15003 dataset

write_csv(acs_b15003_clean, "processed_data/acs_b15003_2020_county_clean.csv")
saveRDS(acs_b15003_clean,  "processed_data/acs_b15003_2020_county_clean.rds")

# Confirm saved files
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"      "acs_all_2020_county_clean.rds"     
##  [3] "acs_b15003_2020_county_clean.csv"   "acs_b15003_2020_county_clean.rds"  
##  [5] "acs_b19013_2020_county_clean.csv"   "acs_b19013_2020_county_clean.rds"  
##  [7] "acs_b28002_2020_county_clean.csv"   "acs_b28002_2020_county_clean.rds"  
##  [9] "airband_2020_county_clean.csv"      "airband_2020_county_clean.rds"     
## [11] "airband_inconsistency_counties.csv" "analysis_2020_county_final_sf.rds" 
## [13] "analysis_2020_county_final.csv"     "analysis_2020_county_final.rds"    
## [15] "analysis_2020_county_sf.rds"        "analysis_2020_county.csv"          
## [17] "analysis_2020_county.rds"           "counties_sf.rds"                   
## [19] "fcc_2020_dec_clean.csv"             "fcc_2020_dec_clean.rds"            
## [21] "master_2020_county_sf.rds"          "master_2020_county.csv"            
## [23] "master_2020_county.rds"             "merged_fcc_svi_2020.csv"           
## [25] "merged_fcc_svi_2020.rds"            "merged_fcc_svi_airband_2020.csv"   
## [27] "merged_fcc_svi_airband_2020.rds"    "svi_2020_county_clean.csv"         
## [29] "svi_2020_county_clean.rds"
library(dplyr)
library(readr)
library(stringr)

# 8.4.1 Load cleaned ACS datasets
acs_b28002 <- readRDS("processed_data/acs_b28002_2020_county_clean.rds")
acs_b19013 <- readRDS("processed_data/acs_b19013_2020_county_clean.rds")
acs_b15003 <- readRDS("processed_data/acs_b15003_2020_county_clean.rds")

# 8.4.2 Prepare B28002 (internet/computer)
acs_b28002_small <- acs_b28002 %>% 
  dplyr::rename(
    internet_total_hh   = B28002_001E,
    internet_broadband  = B28002_002E,
    internet_no_access  = B28002_012E,
    computer_no_device  = B28002_013E
  )

# 8.4.3 Prepare B19013 (income)
acs_b19013_small <- acs_b19013 %>% 
  dplyr::select(
    FIPS,
    income_median = B19013_001E
  )

# 8.4.4 Prepare B15003 (education)
acs_b15003_small <- acs_b15003 %>% 
  dplyr::select(
    FIPS,
    edu_total_25plus = B15003_001E,
    edu_hs           = B15003_017E,
    edu_bach         = B15003_022E,
    edu_mast         = B15003_023E,
    edu_doc          = B15003_025E
  )

# 8.4.5 Merge all three ACS datasets
acs_2020_all <- acs_b28002_small %>%
  dplyr::left_join(acs_b19013_small, by = "FIPS") %>%
  dplyr::left_join(acs_b15003_small, by = "FIPS")

glimpse(acs_2020_all)
## Rows: 3,221
## Columns: 13
## $ FIPS               <chr> "01001", "01003", "01005", "01007", "01009", "01011…
## $ GEO_ID             <chr> "0500000US01001", "0500000US01003", "0500000US01005…
## $ NAME               <chr> "Autauga County, Alabama", "Baldwin County, Alabama…
## $ internet_total_hh  <chr> "21559", "84047", "9322", "7259", "21205", "3429", …
## $ internet_broadband <chr> "17850", "71880", "6059", "5529", "16971", "2153", …
## $ internet_no_access <chr> "402", "2149", "563", "210", "314", "236", "150", "…
## $ computer_no_device <chr> "3307", "10018", "2700", "1520", "3920", "1040", "1…
## $ income_median      <chr> "57982", "61756", "34990", "51721", "48922", "33866…
## $ edu_total_25plus   <chr> "37860", "155563", "17797", "15987", "39814", "6784…
## $ edu_hs             <chr> "9663", "34521", "5171", "5879", "10582", "2498", "…
## $ edu_bach           <chr> "6320", "31444", "1296", "1183", "3540", "413", "13…
## $ edu_mast           <chr> "3399", "13331", "540", "481", "1382", "152", "651"…
## $ edu_doc            <chr> "498", "2016", "113", "41", "132", "28", "19", "883…
head(acs_2020_all)
## # A tibble: 6 × 13
##   FIPS  GEO_ID     NAME  internet_total_hh internet_broadband internet_no_access
##   <chr> <chr>      <chr> <chr>             <chr>              <chr>             
## 1 01001 0500000US… Auta… 21559             17850              402               
## 2 01003 0500000US… Bald… 84047             71880              2149              
## 3 01005 0500000US… Barb… 9322              6059               563               
## 4 01007 0500000US… Bibb… 7259              5529               210               
## 5 01009 0500000US… Blou… 21205             16971              314               
## 6 01011 0500000US… Bull… 3429              2153               236               
## # ℹ 7 more variables: computer_no_device <chr>, income_median <chr>,
## #   edu_total_25plus <chr>, edu_hs <chr>, edu_bach <chr>, edu_mast <chr>,
## #   edu_doc <chr>
# 8.4.x Ensure ACS numeric columns are numeric, not character

acs_2020_all <- acs_2020_all %>%
  mutate(
    across(
      c(
        internet_total_hh,
        internet_broadband,
        internet_no_access,
        computer_no_device,
        income_median,
        edu_total_25plus,
        edu_hs,
        edu_bach,
        edu_mast,
        edu_doc
      ),
      ~ as.numeric(.)
    )
  )

# Check types again
glimpse(acs_2020_all)
## Rows: 3,221
## Columns: 13
## $ FIPS               <chr> "01001", "01003", "01005", "01007", "01009", "01011…
## $ GEO_ID             <chr> "0500000US01001", "0500000US01003", "0500000US01005…
## $ NAME               <chr> "Autauga County, Alabama", "Baldwin County, Alabama…
## $ internet_total_hh  <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, 44572,…
## $ internet_broadband <dbl> 17850, 71880, 6059, 5529, 16971, 2153, 4941, 35642,…
## $ internet_no_access <dbl> 402, 2149, 563, 210, 314, 236, 150, 931, 284, 250, …
## $ computer_no_device <dbl> 3307, 10018, 2700, 1520, 3920, 1040, 1558, 7999, 31…
## $ income_median      <dbl> 57982, 61756, 34990, 51721, 48922, 33866, 44850, 50…
## $ edu_total_25plus   <dbl> 37860, 155563, 17797, 15987, 39814, 6784, 13820, 78…
## $ edu_hs             <dbl> 9663, 34521, 5171, 5879, 10582, 2498, 5583, 21801, …
## $ edu_bach           <dbl> 6320, 31444, 1296, 1183, 3540, 413, 1396, 8296, 249…
## $ edu_mast           <dbl> 3399, 13331, 540, 481, 1382, 152, 651, 4618, 740, 7…
## $ edu_doc            <dbl> 498, 2016, 113, 41, 132, 28, 19, 883, 29, 132, 106,…
# Save the final cleaned + numeric + merged ACS dataset

write_csv(acs_2020_all, "processed_data/acs_all_2020_county_clean.csv")
saveRDS(acs_2020_all,  "processed_data/acs_all_2020_county_clean.rds")

# Confirm saved
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"      "acs_all_2020_county_clean.rds"     
##  [3] "acs_b15003_2020_county_clean.csv"   "acs_b15003_2020_county_clean.rds"  
##  [5] "acs_b19013_2020_county_clean.csv"   "acs_b19013_2020_county_clean.rds"  
##  [7] "acs_b28002_2020_county_clean.csv"   "acs_b28002_2020_county_clean.rds"  
##  [9] "airband_2020_county_clean.csv"      "airband_2020_county_clean.rds"     
## [11] "airband_inconsistency_counties.csv" "analysis_2020_county_final_sf.rds" 
## [13] "analysis_2020_county_final.csv"     "analysis_2020_county_final.rds"    
## [15] "analysis_2020_county_sf.rds"        "analysis_2020_county.csv"          
## [17] "analysis_2020_county.rds"           "counties_sf.rds"                   
## [19] "fcc_2020_dec_clean.csv"             "fcc_2020_dec_clean.rds"            
## [21] "master_2020_county_sf.rds"          "master_2020_county.csv"            
## [23] "master_2020_county.rds"             "merged_fcc_svi_2020.csv"           
## [25] "merged_fcc_svi_2020.rds"            "merged_fcc_svi_airband_2020.csv"   
## [27] "merged_fcc_svi_airband_2020.rds"    "svi_2020_county_clean.csv"         
## [29] "svi_2020_county_clean.rds"

19 9. Merge All Cleaned Datasets into One Master Dataset

Section 9 creates a single combined dataset by merging:

FCC broadband availability

SVI (overall + 4 themes)

Microsoft Airband broadband usage

ACS (internet, computer access, income, education)

20 9.1 Load All Cleaned Datasets

library(dplyr)
library(readr)

# Load FCC + SVI + Airband merged dataset
merged_fcc_svi_airband <- readRDS("processed_data/merged_fcc_svi_airband_2020.rds")

# Load merged ACS dataset
acs_2020_all <- readRDS("processed_data/acs_all_2020_county_clean.rds")

# Quick preview
head(merged_fcc_svi_airband)
## # A tibble: 6 × 18
##   county_fips state_name county_name.x  housing_units tier1 tier2 tier3 state.x
##   <chr>       <chr>      <chr>                  <dbl> <dbl> <dbl> <dbl> <chr>  
## 1 01001       Alabama    Autauga County         24165     4     4     4 AL     
## 2 01003       Alabama    Baldwin County        122518     4     3     3 AL     
## 3 01005       Alabama    Barbour County         12120     3     3     2 AL     
## 4 01007       Alabama    Bibb County             9340     3     2     1 AL     
## 5 01009       Alabama    Blount County          24625     3     3     2 AL     
## 6 01011       Alabama    Bullock County          4625     3     3     1 AL     
## # ℹ 10 more variables: county <chr>, svi_overall <dbl>, svi_soc <dbl>,
## #   svi_hh <dbl>, svi_min <dbl>, svi_hous <dbl>, state.y <chr>,
## #   county_name.y <chr>, airband_fcc_availability <chr>, airband_usage <chr>
head(acs_2020_all)
## # A tibble: 6 × 13
##   FIPS  GEO_ID     NAME  internet_total_hh internet_broadband internet_no_access
##   <chr> <chr>      <chr>             <dbl>              <dbl>              <dbl>
## 1 01001 0500000US… Auta…             21559              17850                402
## 2 01003 0500000US… Bald…             84047              71880               2149
## 3 01005 0500000US… Barb…              9322               6059                563
## 4 01007 0500000US… Bibb…              7259               5529                210
## 5 01009 0500000US… Blou…             21205              16971                314
## 6 01011 0500000US… Bull…              3429               2153                236
## # ℹ 7 more variables: computer_no_device <dbl>, income_median <dbl>,
## #   edu_total_25plus <dbl>, edu_hs <dbl>, edu_bach <dbl>, edu_mast <dbl>,
## #   edu_doc <dbl>
merged_fcc_svi_airband <- readRDS("processed_data/merged_fcc_svi_airband_2020.rds")

21 9.2 Merge Core Dataset with ACS

Merging is simple: Left join so that FIPS from FCC/SVI/Airband stays the base.

# 9.2 Create Master 2020 County Dataset

# 9.2.1 Ensure common key name: county_fips -> FIPS
merged_fcc_svi_airband <- merged_fcc_svi_airband %>%
  dplyr::rename(FIPS = county_fips)

# 9.2.2 Merge FCC + SVI + Airband with ACS (by FIPS)
master_2020 <- merged_fcc_svi_airband %>%
  dplyr::left_join(acs_2020_all, by = "FIPS")

# 9.2.3 Convert Airband columns to numeric (if they are stored as character)
master_2020 <- master_2020 %>%
  dplyr::mutate(
    airband_fcc_availability = as.numeric(airband_fcc_availability),
    airband_usage            = as.numeric(airband_usage)
  )

# 9.2.4 Clean up duplicate state/county columns and keep a tidy set
master_2020 <- master_2020 %>%
  dplyr::select(
    FIPS,
    state_name,
    county_name = county_name.y,   # final county name
    housing_units,
    tier1, tier2, tier3,
    svi_overall, svi_soc, svi_hh, svi_min, svi_hous,
    airband_fcc_availability,
    airband_usage,
    GEO_ID, NAME,
    internet_total_hh,
    internet_broadband,
    internet_no_access,
    computer_no_device,
    income_median,
    edu_total_25plus,
    edu_hs,
    edu_bach,
    edu_mast,
    edu_doc
  )

# 9.2.5 Quick structural check
glimpse(master_2020)
## Rows: 3,141
## Columns: 26
## $ FIPS                     <chr> "01001", "01003", "01005", "01007", "01009", …
## $ state_name               <chr> "Alabama", "Alabama", "Alabama", "Alabama", "…
## $ county_name              <chr> "Autauga County", "Baldwin County", "Barbour …
## $ housing_units            <dbl> 24165, 122518, 12120, 9340, 24625, 4625, 1017…
## $ tier1                    <dbl> 4, 4, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 2, …
## $ tier2                    <dbl> 4, 3, 3, 2, 3, 3, 3, 3, 3, 2, 3, 2, 2, 2, 2, …
## $ tier3                    <dbl> 4, 3, 2, 1, 2, 1, 2, 3, 3, 2, 2, 1, 2, 2, 1, …
## $ svi_overall              <dbl> 0.5130, 0.3103, 0.9927, 0.8078, 0.5137, 0.831…
## $ svi_soc                  <dbl> 0.3838, 0.3253, 0.9567, 0.8008, 0.5757, 0.718…
## $ svi_hh                   <dbl> 0.7362, 0.2724, 0.9453, 0.5512, 0.7225, 0.651…
## $ svi_min                  <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.980…
## $ svi_hous                 <dbl> 0.4309, 0.3612, 0.9949, 0.8622, 0.2743, 0.850…
## $ airband_fcc_availability <dbl> 0.8057, 0.8362, 0.6891, 0.3368, 0.7580, 0.936…
## $ airband_usage            <dbl> 0.391, 0.452, 0.324, 0.136, 0.199, 0.157, 0.1…
## $ GEO_ID                   <chr> "0500000US01001", "0500000US01003", "0500000U…
## $ NAME                     <chr> "Autauga County, Alabama", "Baldwin County, A…
## $ internet_total_hh        <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, …
## $ internet_broadband       <dbl> 17850, 71880, 6059, 5529, 16971, 2153, 4941, …
## $ internet_no_access       <dbl> 402, 2149, 563, 210, 314, 236, 150, 931, 284,…
## $ computer_no_device       <dbl> 3307, 10018, 2700, 1520, 3920, 1040, 1558, 79…
## $ income_median            <dbl> 57982, 61756, 34990, 51721, 48922, 33866, 448…
## $ edu_total_25plus         <dbl> 37860, 155563, 17797, 15987, 39814, 6784, 138…
## $ edu_hs                   <dbl> 9663, 34521, 5171, 5879, 10582, 2498, 5583, 2…
## $ edu_bach                 <dbl> 6320, 31444, 1296, 1183, 3540, 413, 1396, 829…
## $ edu_mast                 <dbl> 3399, 13331, 540, 481, 1382, 152, 651, 4618, …
## $ edu_doc                  <dbl> 498, 2016, 113, 41, 132, 28, 19, 883, 29, 132…
head(master_2020)
## # A tibble: 6 × 26
##   FIPS  state_name county_name    housing_units tier1 tier2 tier3 svi_overall
##   <chr> <chr>      <chr>                  <dbl> <dbl> <dbl> <dbl>       <dbl>
## 1 01001 Alabama    Autauga County         24165     4     4     4       0.513
## 2 01003 Alabama    Baldwin County        122518     4     3     3       0.310
## 3 01005 Alabama    Barbour County         12120     3     3     2       0.993
## 4 01007 Alabama    Bibb County             9340     3     2     1       0.808
## 5 01009 Alabama    Blount County          24625     3     3     2       0.514
## 6 01011 Alabama    Bullock County          4625     3     3     1       0.831
## # ℹ 18 more variables: svi_soc <dbl>, svi_hh <dbl>, svi_min <dbl>,
## #   svi_hous <dbl>, airband_fcc_availability <dbl>, airband_usage <dbl>,
## #   GEO_ID <chr>, NAME <chr>, internet_total_hh <dbl>,
## #   internet_broadband <dbl>, internet_no_access <dbl>,
## #   computer_no_device <dbl>, income_median <dbl>, edu_total_25plus <dbl>,
## #   edu_hs <dbl>, edu_bach <dbl>, edu_mast <dbl>, edu_doc <dbl>
# 9.3 Save Final Master 2020 County Dataset

write_csv(master_2020, "processed_data/master_2020_county.csv")
saveRDS(master_2020,  "processed_data/master_2020_county.rds")

# Check that it's there
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"      "acs_all_2020_county_clean.rds"     
##  [3] "acs_b15003_2020_county_clean.csv"   "acs_b15003_2020_county_clean.rds"  
##  [5] "acs_b19013_2020_county_clean.csv"   "acs_b19013_2020_county_clean.rds"  
##  [7] "acs_b28002_2020_county_clean.csv"   "acs_b28002_2020_county_clean.rds"  
##  [9] "airband_2020_county_clean.csv"      "airband_2020_county_clean.rds"     
## [11] "airband_inconsistency_counties.csv" "analysis_2020_county_final_sf.rds" 
## [13] "analysis_2020_county_final.csv"     "analysis_2020_county_final.rds"    
## [15] "analysis_2020_county_sf.rds"        "analysis_2020_county.csv"          
## [17] "analysis_2020_county.rds"           "counties_sf.rds"                   
## [19] "fcc_2020_dec_clean.csv"             "fcc_2020_dec_clean.rds"            
## [21] "master_2020_county_sf.rds"          "master_2020_county.csv"            
## [23] "master_2020_county.rds"             "merged_fcc_svi_2020.csv"           
## [25] "merged_fcc_svi_2020.rds"            "merged_fcc_svi_airband_2020.csv"   
## [27] "merged_fcc_svi_airband_2020.rds"    "svi_2020_county_clean.csv"         
## [29] "svi_2020_county_clean.rds"

22 10. Create Spatial Master Dataset (Attach County Geometry)

The goal of this section is to combine:

counties_sf.rds → county boundaries (shapefile processed in Section 7)

master_2020_county.rds → all your attributes (FCC, SVI, Airband, ACS)

into a single sf object:

master_2020_sf → one row per county, with both attributes and geometry.

We’ll join them using:

GEOID from the shapefile

FIPS from master_2020

library(sf)
library(dplyr)

# 10.1 Load processed county shapefile (sf object)
counties_sf <- readRDS("processed_data/counties_sf.rds")

# 10.1 Load non-spatial master dataset
master_2020 <- readRDS("processed_data/master_2020_county.rds")

# Quick checks
counties_sf
## Simple feature collection with 3234 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS:  WGS 84
## First 10 features:
##    GEOID        NAME STATEFP COUNTYFP                       geometry
## 1  31039      Cuming      31      039 MULTIPOLYGON (((-97.01952 4...
## 2  53069   Wahkiakum      53      069 MULTIPOLYGON (((-123.4364 4...
## 3  35011     De Baca      35      011 MULTIPOLYGON (((-104.5674 3...
## 4  31109   Lancaster      31      109 MULTIPOLYGON (((-96.91075 4...
## 5  31129    Nuckolls      31      129 MULTIPOLYGON (((-98.27367 4...
## 6  72085 Las Piedras      72      085 MULTIPOLYGON (((-65.91048 1...
## 7  46099   Minnehaha      46      099 MULTIPOLYGON (((-96.89022 4...
## 8  48327      Menard      48      327 MULTIPOLYGON (((-99.82187 3...
## 9  06091      Sierra      06      091 MULTIPOLYGON (((-120.6556 3...
## 10 21053     Clinton      21      053 MULTIPOLYGON (((-85.2391 36...
glimpse(master_2020)
## Rows: 3,141
## Columns: 26
## $ FIPS                     <chr> "01001", "01003", "01005", "01007", "01009", …
## $ state_name               <chr> "Alabama", "Alabama", "Alabama", "Alabama", "…
## $ county_name              <chr> "Autauga County", "Baldwin County", "Barbour …
## $ housing_units            <dbl> 24165, 122518, 12120, 9340, 24625, 4625, 1017…
## $ tier1                    <dbl> 4, 4, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 2, …
## $ tier2                    <dbl> 4, 3, 3, 2, 3, 3, 3, 3, 3, 2, 3, 2, 2, 2, 2, …
## $ tier3                    <dbl> 4, 3, 2, 1, 2, 1, 2, 3, 3, 2, 2, 1, 2, 2, 1, …
## $ svi_overall              <dbl> 0.5130, 0.3103, 0.9927, 0.8078, 0.5137, 0.831…
## $ svi_soc                  <dbl> 0.3838, 0.3253, 0.9567, 0.8008, 0.5757, 0.718…
## $ svi_hh                   <dbl> 0.7362, 0.2724, 0.9453, 0.5512, 0.7225, 0.651…
## $ svi_min                  <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.980…
## $ svi_hous                 <dbl> 0.4309, 0.3612, 0.9949, 0.8622, 0.2743, 0.850…
## $ airband_fcc_availability <dbl> 0.8057, 0.8362, 0.6891, 0.3368, 0.7580, 0.936…
## $ airband_usage            <dbl> 0.391, 0.452, 0.324, 0.136, 0.199, 0.157, 0.1…
## $ GEO_ID                   <chr> "0500000US01001", "0500000US01003", "0500000U…
## $ NAME                     <chr> "Autauga County, Alabama", "Baldwin County, A…
## $ internet_total_hh        <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, …
## $ internet_broadband       <dbl> 17850, 71880, 6059, 5529, 16971, 2153, 4941, …
## $ internet_no_access       <dbl> 402, 2149, 563, 210, 314, 236, 150, 931, 284,…
## $ computer_no_device       <dbl> 3307, 10018, 2700, 1520, 3920, 1040, 1558, 79…
## $ income_median            <dbl> 57982, 61756, 34990, 51721, 48922, 33866, 448…
## $ edu_total_25plus         <dbl> 37860, 155563, 17797, 15987, 39814, 6784, 138…
## $ edu_hs                   <dbl> 9663, 34521, 5171, 5879, 10582, 2498, 5583, 2…
## $ edu_bach                 <dbl> 6320, 31444, 1296, 1183, 3540, 413, 1396, 829…
## $ edu_mast                 <dbl> 3399, 13331, 540, 481, 1382, 152, 651, 4618, …
## $ edu_doc                  <dbl> 498, 2016, 113, 41, 132, 28, 19, 883, 29, 132…

22.1 10.2 Spatial Join: Attach Attributes to Geometry

To keep the geometry, we start from counties_sf and left_join() the master_2020 table into it.

# 10.2 Join master data to county polygons
# GEOID (shapefile) matches FIPS (master table)

master_2020_sf <- counties_sf %>% 
  left_join(master_2020, by = c("GEOID" = "FIPS"))

# Check result: should be an sf object with all master_2020 columns + geometry
master_2020_sf
## Simple feature collection with 3234 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS:  WGS 84
## First 10 features:
##    GEOID      NAME.x STATEFP COUNTYFP   state_name      county_name
## 1  31039      Cuming      31      039     Nebraska    Cuming County
## 2  53069   Wahkiakum      53      069   Washington Wahkiakum County
## 3  35011     De Baca      35      011   New Mexico   De Baca County
## 4  31109   Lancaster      31      109     Nebraska Lancaster County
## 5  31129    Nuckolls      31      129     Nebraska  Nuckolls County
## 6  72085 Las Piedras      72      085         <NA>             <NA>
## 7  46099   Minnehaha      46      099 South Dakota Minnehaha County
## 8  48327      Menard      48      327        Texas    Menard County
## 9  06091      Sierra      06      091   California    Sierra County
## 10 21053     Clinton      21      053     Kentucky   Clinton County
##    housing_units tier1 tier2 tier3 svi_overall svi_soc svi_hh svi_min svi_hous
## 1           4272     3     3     2      0.1610  0.1101 0.5738  0.3762   0.1439
## 2           2196     4     3     2      0.1706  0.2314 0.2167  0.4411   0.1467
## 3           1378     3     2     2      0.5891  0.7782 0.2447  0.9621   0.2775
## 4         136694     4     4     4      0.3781  0.2629 0.2412  0.5407   0.6785
## 5           2448     4     3     2      0.0707  0.0904 0.2371  0.1359   0.1248
## 6             NA    NA    NA    NA          NA      NA     NA      NA       NA
## 7          87161     4     4     4      0.4013  0.1677 0.5484  0.5242   0.7072
## 8           1736     2     1     1      0.7702  0.6680 0.8211  0.7855   0.6292
## 9           2353     3     2     1      0.2810  0.5325 0.3501  0.3682   0.0792
## 10          5327     3     3     2      0.7600  0.8291 0.7358  0.0907   0.7728
##    airband_fcc_availability airband_usage         GEO_ID
## 1                    0.8900         0.284 0500000US31039
## 2                    0.6246         0.273 0500000US53069
## 3                    0.8124         0.455 0500000US35011
## 4                    1.0000         0.677 0500000US31109
## 5                    0.7406         0.429 0500000US31129
## 6                        NA            NA           <NA>
## 7                    0.9855         0.692 0500000US46099
## 8                    0.9097         0.050 0500000US48327
## 9                    0.6599         0.077 0500000US06091
## 10                   0.9208         0.129 0500000US21053
##                            NAME.y internet_total_hh internet_broadband
## 1         Cuming County, Nebraska              3854               2817
## 2    Wahkiakum County, Washington              1900               1551
## 3      De Baca County, New Mexico               554                381
## 4      Lancaster County, Nebraska            126666             113669
## 5       Nuckolls County, Nebraska              1852               1415
## 6                            <NA>                NA                 NA
## 7  Minnehaha County, South Dakota             78453              69385
## 8            Menard County, Texas              1035                662
## 9       Sierra County, California              1250                946
## 10       Clinton County, Kentucky              4023               2472
##    internet_no_access computer_no_device income_median edu_total_25plus edu_hs
## 1                 208                829         59202             6071   2103
## 2                  83                266         54524             3356    719
## 3                  12                161         31532              941    248
## 4                3450               9547         62464           196706  34338
## 5                  37                400         52975             3060    857
## 6                  NA                 NA            NA               NA     NA
## 7                2515               6553         63699           126191  28726
## 8                  95                278         43826             1670    335
## 9                  74                230         52103             2302    559
## 10                 61               1490         33092             7209   2450
##    edu_bach edu_mast edu_doc                       geometry
## 1      1026      259      15 MULTIPOLYGON (((-97.01952 4...
## 2       389      253      20 MULTIPOLYGON (((-123.4364 4...
## 3        63       62       0 MULTIPOLYGON (((-104.5674 3...
## 4     48821    19426    5242 MULTIPOLYGON (((-96.91075 4...
## 5       499      153      48 MULTIPOLYGON (((-98.27367 4...
## 6        NA       NA      NA MULTIPOLYGON (((-65.91048 1...
## 7     29293     9810    1293 MULTIPOLYGON (((-96.89022 4...
## 8       253      114      33 MULTIPOLYGON (((-99.82187 3...
## 9       202      164       9 MULTIPOLYGON (((-120.6556 3...
## 10      494      255      80 MULTIPOLYGON (((-85.2391 36...
sf::st_crs(master_2020_sf)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         MEMBER["World Geodetic System 1984 (G2296)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]
glimpse(master_2020_sf)
## Rows: 3,234
## Columns: 30
## $ GEOID                    <chr> "31039", "53069", "35011", "31109", "31129", …
## $ NAME.x                   <chr> "Cuming", "Wahkiakum", "De Baca", "Lancaster"…
## $ STATEFP                  <chr> "31", "53", "35", "31", "31", "72", "46", "48…
## $ COUNTYFP                 <chr> "039", "069", "011", "109", "129", "085", "09…
## $ state_name               <chr> "Nebraska", "Washington", "New Mexico", "Nebr…
## $ county_name              <chr> "Cuming County", "Wahkiakum County", "De Baca…
## $ housing_units            <dbl> 4272, 2196, 1378, 136694, 2448, NA, 87161, 17…
## $ tier1                    <dbl> 3, 4, 3, 4, 4, NA, 4, 2, 3, 3, 4, 4, 3, 3, 4,…
## $ tier2                    <dbl> 3, 3, 2, 4, 3, NA, 4, 1, 2, 3, 4, 3, 2, 3, 4,…
## $ tier3                    <dbl> 2, 2, 2, 4, 2, NA, 4, 1, 1, 2, 4, 3, 2, 1, 4,…
## $ svi_overall              <dbl> 0.1610, 0.1706, 0.5891, 0.3781, 0.0707, NA, 0…
## $ svi_soc                  <dbl> 0.1101, 0.2314, 0.7782, 0.2629, 0.0904, NA, 0…
## $ svi_hh                   <dbl> 0.5738, 0.2167, 0.2447, 0.2412, 0.2371, NA, 0…
## $ svi_min                  <dbl> 0.3762, 0.4411, 0.9621, 0.5407, 0.1359, NA, 0…
## $ svi_hous                 <dbl> 0.1439, 0.1467, 0.2775, 0.6785, 0.1248, NA, 0…
## $ airband_fcc_availability <dbl> 0.8900, 0.6246, 0.8124, 1.0000, 0.7406, NA, 0…
## $ airband_usage            <dbl> 0.284, 0.273, 0.455, 0.677, 0.429, NA, 0.692,…
## $ GEO_ID                   <chr> "0500000US31039", "0500000US53069", "0500000U…
## $ NAME.y                   <chr> "Cuming County, Nebraska", "Wahkiakum County,…
## $ internet_total_hh        <dbl> 3854, 1900, 554, 126666, 1852, NA, 78453, 103…
## $ internet_broadband       <dbl> 2817, 1551, 381, 113669, 1415, NA, 69385, 662…
## $ internet_no_access       <dbl> 208, 83, 12, 3450, 37, NA, 2515, 95, 74, 61, …
## $ computer_no_device       <dbl> 829, 266, 161, 9547, 400, NA, 6553, 278, 230,…
## $ income_median            <dbl> 59202, 54524, 31532, 62464, 52975, NA, 63699,…
## $ edu_total_25plus         <dbl> 6071, 3356, 941, 196706, 3060, NA, 126191, 16…
## $ edu_hs                   <dbl> 2103, 719, 248, 34338, 857, NA, 28726, 335, 5…
## $ edu_bach                 <dbl> 1026, 389, 63, 48821, 499, NA, 29293, 253, 20…
## $ edu_mast                 <dbl> 259, 253, 62, 19426, 153, NA, 9810, 114, 164,…
## $ edu_doc                  <dbl> 15, 20, 0, 5242, 48, NA, 1293, 33, 9, 80, 473…
## $ geometry                 <MULTIPOLYGON [°]> MULTIPOLYGON (((-97.01952 4..., …

23 10.3 Save the Spatial Dataset

# At this point, you should have created this object:
# master_2020_sf <- left_join(counties_sf, master_2020, by = "GEOID")

# Check structure
glimpse(master_2020_sf)
## Rows: 3,234
## Columns: 30
## $ GEOID                    <chr> "31039", "53069", "35011", "31109", "31129", …
## $ NAME.x                   <chr> "Cuming", "Wahkiakum", "De Baca", "Lancaster"…
## $ STATEFP                  <chr> "31", "53", "35", "31", "31", "72", "46", "48…
## $ COUNTYFP                 <chr> "039", "069", "011", "109", "129", "085", "09…
## $ state_name               <chr> "Nebraska", "Washington", "New Mexico", "Nebr…
## $ county_name              <chr> "Cuming County", "Wahkiakum County", "De Baca…
## $ housing_units            <dbl> 4272, 2196, 1378, 136694, 2448, NA, 87161, 17…
## $ tier1                    <dbl> 3, 4, 3, 4, 4, NA, 4, 2, 3, 3, 4, 4, 3, 3, 4,…
## $ tier2                    <dbl> 3, 3, 2, 4, 3, NA, 4, 1, 2, 3, 4, 3, 2, 3, 4,…
## $ tier3                    <dbl> 2, 2, 2, 4, 2, NA, 4, 1, 1, 2, 4, 3, 2, 1, 4,…
## $ svi_overall              <dbl> 0.1610, 0.1706, 0.5891, 0.3781, 0.0707, NA, 0…
## $ svi_soc                  <dbl> 0.1101, 0.2314, 0.7782, 0.2629, 0.0904, NA, 0…
## $ svi_hh                   <dbl> 0.5738, 0.2167, 0.2447, 0.2412, 0.2371, NA, 0…
## $ svi_min                  <dbl> 0.3762, 0.4411, 0.9621, 0.5407, 0.1359, NA, 0…
## $ svi_hous                 <dbl> 0.1439, 0.1467, 0.2775, 0.6785, 0.1248, NA, 0…
## $ airband_fcc_availability <dbl> 0.8900, 0.6246, 0.8124, 1.0000, 0.7406, NA, 0…
## $ airband_usage            <dbl> 0.284, 0.273, 0.455, 0.677, 0.429, NA, 0.692,…
## $ GEO_ID                   <chr> "0500000US31039", "0500000US53069", "0500000U…
## $ NAME.y                   <chr> "Cuming County, Nebraska", "Wahkiakum County,…
## $ internet_total_hh        <dbl> 3854, 1900, 554, 126666, 1852, NA, 78453, 103…
## $ internet_broadband       <dbl> 2817, 1551, 381, 113669, 1415, NA, 69385, 662…
## $ internet_no_access       <dbl> 208, 83, 12, 3450, 37, NA, 2515, 95, 74, 61, …
## $ computer_no_device       <dbl> 829, 266, 161, 9547, 400, NA, 6553, 278, 230,…
## $ income_median            <dbl> 59202, 54524, 31532, 62464, 52975, NA, 63699,…
## $ edu_total_25plus         <dbl> 6071, 3356, 941, 196706, 3060, NA, 126191, 16…
## $ edu_hs                   <dbl> 2103, 719, 248, 34338, 857, NA, 28726, 335, 5…
## $ edu_bach                 <dbl> 1026, 389, 63, 48821, 499, NA, 29293, 253, 20…
## $ edu_mast                 <dbl> 259, 253, 62, 19426, 153, NA, 9810, 114, 164,…
## $ edu_doc                  <dbl> 15, 20, 0, 5242, 48, NA, 1293, 33, 9, 80, 473…
## $ geometry                 <MULTIPOLYGON [°]> MULTIPOLYGON (((-97.01952 4..., …
# Save spatial master dataset
saveRDS(master_2020_sf, "processed_data/master_2020_county_sf.rds")

# Confirm saved file
list.files("processed_data", pattern = "sf")
## [1] "analysis_2020_county_final_sf.rds" "analysis_2020_county_sf.rds"      
## [3] "counties_sf.rds"                   "master_2020_county_sf.rds"

24 11. Filtering and NA removal

The final analysis dataset was created by filtering the spatial master table to retain only counties with complete data across key broadband and vulnerability indicators, including internet households, median income, Airband availability and usage, and SVI overall score. This reduced the dataset from 3,234 county-equivalents in the TIGER/Line shapefile to 3,120 valid U.S. counties with fully complete records. These 3,120 counties form the final, clean analytical base for clustering, spatial joins, and digital divide analysis.

library(dplyr)
library(sf)

# 1) Load the spatial master dataset from Section 10
master_2020_sf <- readRDS("processed_data/master_2020_county_sf.rds")

# 2) Keep only counties with complete data in key fields
analysis_2020_sf <- master_2020_sf %>%
  filter(
    !is.na(internet_total_hh),
    !is.na(income_median),
    !is.na(airband_fcc_availability),
    !is.na(airband_usage),
    !is.na(svi_overall)
  )

# How many counties remain?
nrow(master_2020_sf)      # before
## [1] 3234
nrow(analysis_2020_sf)    # after
## [1] 3120
# 3) Create a non-spatial version (for clustering, regressions, etc.)
analysis_2020 <- analysis_2020_sf %>%
  st_drop_geometry()

# 4) Save final analysis datasets
saveRDS(analysis_2020_sf, "processed_data/analysis_2020_county_sf.rds")
saveRDS(analysis_2020,    "processed_data/analysis_2020_county.rds")
readr::write_csv(analysis_2020, "processed_data/analysis_2020_county.csv")

# 5) Confirm files
list.files("processed_data", pattern = "analysis_2020")
## [1] "analysis_2020_county_final_sf.rds" "analysis_2020_county_final.csv"   
## [3] "analysis_2020_county_final.rds"    "analysis_2020_county_sf.rds"      
## [5] "analysis_2020_county.csv"          "analysis_2020_county.rds"

25 11.1 Consistency Checks

Several internal validation checks were performed to ensure coherence across the merged dataset. Broadband subscription counts were verified to never exceed total household counts, and all ACS-derived broadband and device-access percentages fell within the valid range of 0–100%. A correlation analysis between median household income and the SVI socioeconomic theme produced the expected negative relationship, confirming consistency between ACS demographics and CDC social vulnerability metrics. Some counties displayed Airband usage values greater than modeled availability—this is a known feature of Airband’s methodology and does not represent data errors. Overall, no logical inconsistencies were detected in the final analysis dataset.

library(dplyr)

# Load final non-spatial dataset
analysis_2020 <- readRDS("processed_data/analysis_2020_county.rds")


# 1. Logical Broadband Checks

# Broadband subscriptions cannot exceed households
invalid_broadband_logic <- analysis_2020 %>%
  filter(
    internet_broadband > internet_total_hh |
    internet_no_access > internet_total_hh
  )

invalid_broadband_logic   # ideally 0 rows
##  [1] GEOID                    NAME.x                   STATEFP                 
##  [4] COUNTYFP                 state_name               county_name             
##  [7] housing_units            tier1                    tier2                   
## [10] tier3                    svi_overall              svi_soc                 
## [13] svi_hh                   svi_min                  svi_hous                
## [16] airband_fcc_availability airband_usage            GEO_ID                  
## [19] NAME.y                   internet_total_hh        internet_broadband      
## [22] internet_no_access       computer_no_device       income_median           
## [25] edu_total_25plus         edu_hs                   edu_bach                
## [28] edu_mast                 edu_doc                 
## <0 rows> (or 0-length row.names)
# 2. ACS Percentage Validity

acs_pct_check <- analysis_2020 %>%
  mutate(
    pct_broadband   = internet_broadband / internet_total_hh * 100,
    pct_no_access   = internet_no_access / internet_total_hh * 100,
    pct_no_device   = computer_no_device / internet_total_hh * 100
  )

invalid_pct <- acs_pct_check %>%
  filter(
    pct_broadband < 0 | pct_broadband > 100 |
    pct_no_access < 0 | pct_no_access > 100 |
    pct_no_device < 0 | pct_no_device > 100
  )

invalid_pct    # ideally 0 rows
##  [1] GEOID                    NAME.x                   STATEFP                 
##  [4] COUNTYFP                 state_name               county_name             
##  [7] housing_units            tier1                    tier2                   
## [10] tier3                    svi_overall              svi_soc                 
## [13] svi_hh                   svi_min                  svi_hous                
## [16] airband_fcc_availability airband_usage            GEO_ID                  
## [19] NAME.y                   internet_total_hh        internet_broadband      
## [22] internet_no_access       computer_no_device       income_median           
## [25] edu_total_25plus         edu_hs                   edu_bach                
## [28] edu_mast                 edu_doc                  pct_broadband           
## [31] pct_no_access            pct_no_device           
## <0 rows> (or 0-length row.names)
# 3. Income vs SVI Socioeconomic Relationship

cor_income_svi <- cor(
  analysis_2020$income_median,
  analysis_2020$svi_soc,
  use = "complete.obs"
)

cor_income_svi
## [1] -0.6171082
# 4. Airband Logical Check:

airband_inconsistency <- analysis_2020 %>%
  filter(
    airband_usage > airband_fcc_availability
  )

nrow(airband_inconsistency)
## [1] 50
head(airband_inconsistency)
##   GEOID           NAME.x STATEFP COUNTYFP  state_name              county_name
## 1 22107           Tensas      22      107   Louisiana            Tensas Parish
## 2 02188 Northwest Arctic      02      188      Alaska Northwest Arctic Borough
## 3 01105            Perry      01      105     Alabama             Perry County
## 4 20151            Pratt      20      151      Kansas             Pratt County
## 5 01063           Greene      01      063     Alabama            Greene County
## 6 28055        Issaquena      28      055 Mississippi         Issaquena County
##   housing_units tier1 tier2 tier3 svi_overall svi_soc svi_hh svi_min svi_hous
## 1          3446     2     2     1      0.7234  0.9093 0.1735  0.9144   0.4930
## 2          2763     3     2     1      0.9153  0.8915 0.4376  0.9933   0.9749
## 3          4776     2     2     1      0.9137  0.9803 0.5732  0.9653   0.7314
## 4          4459     4     3     3      0.3877  0.2002 0.7180  0.3819   0.5156
## 5          5191     2     1     1      0.9513  0.9885 0.5191  0.9850   0.8908
## 6           570     2     2     1      0.9847  0.9634 0.4379  0.9297   1.0000
##   airband_fcc_availability airband_usage         GEO_ID
## 1                   0.0305         0.054 0500000US22107
## 2                   0.0081         0.039 0500000US02188
## 3                   0.0012         0.032 0500000US01105
## 4                   0.7943         0.884 0500000US20151
## 5                   0.0085         0.013 0500000US01063
## 6                   0.0173         0.145 0500000US28055
##                             NAME.y internet_total_hh internet_broadband
## 1         Tensas Parish, Louisiana              1814                998
## 2 Northwest Arctic Borough, Alaska              1795               1407
## 3            Perry County, Alabama              3140               1738
## 4             Pratt County, Kansas              3733               3000
## 5           Greene County, Alabama              3178               1768
## 6    Issaquena County, Mississippi               462                154
##   internet_no_access computer_no_device income_median edu_total_25plus edu_hs
## 1                 86                730         29767             3135   1172
## 2                 21                367         63750             4251   1663
## 3                 53               1349         23875             5723   1882
## 4                158                575         54644             6042   1240
## 5                152               1258         26688             5768   2131
## 6                 68                240         28333             1001    280
##   edu_bach edu_mast edu_doc
## 1      469       91      19
## 2      354      213      14
## 3      661      272      71
## 4     1236      378      21
## 5      332      190      46
## 6        9       20       1

Fifty counties exhibited cases where broadband usage exceeded FCC-reported broadband availability. This is an expected pattern in Airband datasets due to differences in modeling assumptions: FCC availability represents provider-reported infrastructure, while Airband usage measures active connectivity across satellite, mobile, and other modalities. These counties were retained, as the differences reflect true measurement variation rather than data quality issues. ## 11.1.1 Table of the fifty counties

# Build a clean table for reporting
airband_inconsistency_table <- airband_inconsistency %>%
  dplyr::select(
    GEOID,
    state_name,
    county_name,
    airband_fcc_availability,
    airband_usage
  ) %>%
  arrange(state_name, county_name)

# Display preview
head(airband_inconsistency_table)
##   GEOID state_name                       county_name airband_fcc_availability
## 1 01063    Alabama                     Greene County                   0.0085
## 2 01105    Alabama                      Perry County                   0.0012
## 3 02185     Alaska               North Slope Borough                   0.0061
## 4 02188     Alaska          Northwest Arctic Borough                   0.0081
## 5 02198     Alaska Prince of Wales-Hyder Census Area                   0.0282
## 6 02290     Alaska         Yukon-Koyukuk Census Area                   0.0130
##   airband_usage
## 1         0.013
## 2         0.032
## 3         0.051
## 4         0.039
## 5         0.040
## 6         0.139
# Print full table
airband_inconsistency_table
##    GEOID     state_name                       county_name
## 1  01063        Alabama                     Greene County
## 2  01105        Alabama                      Perry County
## 3  02185         Alaska               North Slope Borough
## 4  02188         Alaska          Northwest Arctic Borough
## 5  02198         Alaska Prince of Wales-Hyder Census Area
## 6  02290         Alaska         Yukon-Koyukuk Census Area
## 7  05013       Arkansas                    Calhoun County
## 8  05095       Arkansas                     Monroe County
## 9  12029        Florida                      Dixie County
## 10 13053        Georgia              Chattahoochee County
## 11 13101        Georgia                     Echols County
## 12 13125        Georgia                   Glascock County
## 13 13201        Georgia                     Miller County
## 14 13259        Georgia                    Stewart County
## 15 13301        Georgia                     Warren County
## 16 15005         Hawaii                    Kalawao County
## 17 16069          Idaho                  Nez Perce County
## 18 17003       Illinois                  Alexander County
## 19 20119         Kansas                      Meade County
## 20 20151         Kansas                      Pratt County
## 21 22023      Louisiana                    Cameron Parish
## 22 22107      Louisiana                     Tensas Parish
## 23 26095       Michigan                       Luce County
## 24 28031    Mississippi                  Covington County
## 25 28055    Mississippi                  Issaquena County
## 26 28063    Mississippi                  Jefferson County
## 27 29017       Missouri                  Bollinger County
## 28 29089       Missouri                     Howard County
## 29 29153       Missouri                      Ozark County
## 30 29157       Missouri                      Perry County
## 31 30033        Montana                   Garfield County
## 32 30055        Montana                     McCone County
## 33 30109        Montana                     Wibaux County
## 34 32015         Nevada                     Lander County
## 35 32033         Nevada                 White Pine County
## 36 35053     New Mexico                    Socorro County
## 37 38053   North Dakota                   McKenzie County
## 38 41021         Oregon                    Gilliam County
## 39 45065 South Carolina                  McCormick County
## 40 48211          Texas                   Hemphill County
## 41 48271          Texas                     Kinney County
## 42 48301          Texas                     Loving County
## 43 49019           Utah                      Grand County
## 44 51570       Virginia             Colonial Heights city
## 45 51610       Virginia                 Falls Church city
## 46 51081       Virginia                Greensville County
## 47 51683       Virginia                     Manassas city
## 48 51820       Virginia                   Waynesboro city
## 49 51840       Virginia                   Winchester city
## 50 54061  West Virginia                 Monongalia County
##    airband_fcc_availability airband_usage
## 1                    0.0085         0.013
## 2                    0.0012         0.032
## 3                    0.0061         0.051
## 4                    0.0081         0.039
## 5                    0.0282         0.040
## 6                    0.0130         0.139
## 7                    0.0852         0.202
## 8                    0.0557         0.058
## 9                    0.0078         0.041
## 10                   0.5456         0.755
## 11                   0.0002         0.034
## 12                   0.0101         0.033
## 13                   0.1072         0.187
## 14                   0.0082         0.047
## 15                   0.0019         0.049
## 16                   0.1977         0.221
## 17                   0.8671         0.870
## 18                   0.0009         0.104
## 19                   0.0010         0.141
## 20                   0.7943         0.884
## 21                   0.0184         0.134
## 22                   0.0305         0.054
## 23                   0.0393         0.093
## 24                   0.0451         0.054
## 25                   0.0173         0.145
## 26                   0.0309         0.053
## 27                   0.0161         0.066
## 28                   0.1625         0.183
## 29                   0.0375         0.048
## 30                   0.5334         0.600
## 31                   0.4285         0.779
## 32                   0.4946         0.514
## 33                   0.0826         0.125
## 34                   0.0022         0.056
## 35                   0.0095         0.083
## 36                   0.0453         0.211
## 37                   0.7279         0.878
## 38                   0.3797         0.409
## 39                   0.4012         0.439
## 40                   0.0937         0.219
## 41                   0.0221         0.091
## 42                   0.3018         0.941
## 43                   0.5325         0.813
## 44                   0.9872         1.000
## 45                   0.9860         1.000
## 46                   0.2399         0.377
## 47                   0.9611         1.000
## 48                   0.9838         1.000
## 49                   0.9978         1.000
## 50                   0.5361         0.562
# Save CSV for documentation
readr::write_csv(
  airband_inconsistency_table,
  "processed_data/airband_inconsistency_counties.csv"
)

26 11.2 Outlier detection

26.1 Outlier Detection and Geographic Validation

In this section, we detect outliers using two statistical approaches: 1. IQR Method – flags extreme values beyond the 1.5×IQR rule
2. Z-Scores – flags values where |z| > 3 relative to the mean

We apply both methods to all continuous variables and then examine the geographic distribution of outlier counties.


## -------------------------------
## 11.1 Identify continuous fields
## -------------------------------
continuous_vars <- c(
  "internet_total_hh", "internet_broadband", "internet_no_access",
  "computer_no_device", "income_median",
  "edu_total_25plus", "edu_hs", "edu_bach", "edu_mast", "edu_doc",
  "airband_fcc_availability", "airband_usage"
)

num_data <- analysis_2020 %>% dplyr::select(GEOID, county_name, state_name, all_of(continuous_vars))


## -------------------------------
## 11.2 IQR Outlier Detection
## -------------------------------
iqr_flag <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  (x < (Q1 - 1.5 * IQR)) | (x > (Q3 + 1.5 * IQR))
}

iqr_outliers <- num_data %>%
  mutate(across(all_of(continuous_vars), iqr_flag))

iqr_summary <- data.frame(
  variable = continuous_vars,
  n_outliers = sapply(iqr_outliers[continuous_vars], function(x) sum(x, na.rm=TRUE))
)

iqr_summary
##                                          variable n_outliers
## internet_total_hh               internet_total_hh        418
## internet_broadband             internet_broadband        434
## internet_no_access             internet_no_access        393
## computer_no_device             computer_no_device        323
## income_median                       income_median        126
## edu_total_25plus                 edu_total_25plus        429
## edu_hs                                     edu_hs        379
## edu_bach                                 edu_bach        480
## edu_mast                                 edu_mast        466
## edu_doc                                   edu_doc        480
## airband_fcc_availability airband_fcc_availability        206
## airband_usage                       airband_usage          0
## -------------------------------
## 11.3 Z-Score Outlier Detection
## -------------------------------
z_df <- num_data %>%
  mutate(across(all_of(continuous_vars),
                ~ (.-mean(., na.rm=TRUE)) / sd(., na.rm=TRUE)))

z_outliers <- z_df %>%
  mutate(across(all_of(continuous_vars), ~ abs(.) > 3))

z_summary <- data.frame(
  variable = continuous_vars,
  n_outliers = sapply(z_outliers[continuous_vars], function(x) sum(x, na.rm=TRUE))
)

z_summary
##                                          variable n_outliers
## internet_total_hh               internet_total_hh         44
## internet_broadband             internet_broadband         46
## internet_no_access             internet_no_access         41
## computer_no_device             computer_no_device         37
## income_median                       income_median         52
## edu_total_25plus                 edu_total_25plus         38
## edu_hs                                     edu_hs         39
## edu_bach                                 edu_bach         47
## edu_mast                                 edu_mast         54
## edu_doc                                   edu_doc         47
## airband_fcc_availability airband_fcc_availability         72
## airband_usage                       airband_usage          0
## 11.4 Compile Outlier Counties (Geographic Context)


# A county is an outlier if flagged by either method
combined_flags <- (iqr_outliers[continuous_vars] | z_outliers[continuous_vars])

# Extract counties flagged at least once
outlier_counties <- num_data %>%
  mutate(any_outlier = apply(combined_flags, 1, any)) %>%
  filter(any_outlier == TRUE) %>%
  dplyr::select(GEOID, county_name, state_name)

# Show first few
head(outlier_counties)
##   GEOID      county_name     state_name
## 1 31109 Lancaster County       Nebraska
## 2 46099 Minnehaha County   South Dakota
## 3 01027      Clay County        Alabama
## 4 05137     Stone County       Arkansas
## 5 42007    Beaver County   Pennsylvania
## 6 37037   Chatham County North Carolina

Interpretation Summary
Large-population counties (e.g., Los Angeles, Cook, Harris) appear as outliers due to extremely high total households, broadband counts, and education levels.
- Very small counties (e.g., Loving TX, Kalawao HI, rural Alaska census areas) produce outliers in the opposite direction because of very small denominators.
- Broadband usage outliers cluster in: - Tribal regions (low broadband adoption)
- Appalachia(low income & limited infrastructure)
- Major metros(very high availability and usage)
- Income outliers highlight: - Rich suburban counties in Virginia, Maryland, California - Poor rural counties in Mississippi, Alabama, and Arkansas

26.2 12. Feature Engineering & Final Variable Preparation

Feature engineering creates new variables that better capture digital divide conditions across U.S. counties. These engineered variables will be used in later analysis, clustering, and visualization.

26.2.1 12.1 Derived Variables (Digital Divide Metrics)

analysis_2020 <- analysis_2020 %>%
  mutate(
    # Broadband gap: difference between "available" vs "actually used"
    broadband_gap = airband_fcc_availability - airband_usage,

    # Affordability index: inverse of income (lower income = higher affordability risk)
    affordability_index = 1 / income_median,

    # Technology mix placeholder (FCC county data does not include tech details)
    technology_mix = NA,

    # Composite score combining SVI + broadband usage deficit
    digital_vulnerability_score =
      (svi_overall * 0.5) +
      ((1 - airband_usage) * 0.5),

    # Priority intervention: worst quartile in BOTH vulnerability & broadband_gap
    priority_intervention_flag =
      (ntile(digital_vulnerability_score, 4) == 4 &
       ntile(broadband_gap, 4) == 4)
  )

26.2.2 12.1.1

library(sf)
library(dplyr)
library(readr)

# Ensure processed_data exists
dir.create("processed_data", showWarnings = FALSE)

# 1) Save updated spatial dataset (with new derived variables)
saveRDS(analysis_2020_sf, 
        "processed_data/analysis_2020_county_final_sf.rds")

# 2) Create non-spatial version and save
analysis_2020 <- analysis_2020_sf %>% st_drop_geometry()

saveRDS(analysis_2020, 
        "processed_data/analysis_2020_county_final.rds")

write_csv(analysis_2020, 
          "processed_data/analysis_2020_county_final.csv")

# Confirm files
list.files("processed_data", pattern = "analysis_2020_county_final")
## [1] "analysis_2020_county_final_sf.rds" "analysis_2020_county_final.csv"   
## [3] "analysis_2020_county_final.rds"

26.2.3 12.2 Data Transformation for Modeling

library(dplyr)

# Load the updated non-spatial dataset (with all derived variables)
analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")

# Now safely transform for modeling
analysis_2020 <- analysis_2020 %>%
  mutate(
    # Standardized versions (z-scores) for clustering
    z_income = scale(income_median),
    z_broadband_usage = scale(airband_usage),
    z_svi = scale(svi_overall),

    # Quartile categories for visualization
    income_quartile = ntile(income_median, 4),
    broadband_quartile = ntile(airband_usage, 4),
    svi_quartile = ntile(svi_overall, 4),

    # Log transformation for skewed variables
    log_income = log(income_median),
    log_total_hh = log(internet_total_hh)
  )

26.2.4 12.3 Data Transformation for Mapping

# Compute centroids for mapping
county_centroids <- st_centroid(counties_sf)

26.3 13. Missing Data Handling

26.3.1 13.1 Missing Data Assessment

We formally checked missingness across all variables:

missing_summary <- sapply(analysis_2020, function(x) sum(is.na(x)))
missing_summary
##                    GEOID                   NAME.x                  STATEFP 
##                        0                        0                        0 
##                 COUNTYFP               state_name              county_name 
##                        0                        0                        0 
##            housing_units                    tier1                    tier2 
##                        0                        0                        0 
##                    tier3              svi_overall                  svi_soc 
##                        0                        0                        0 
##                   svi_hh                  svi_min                 svi_hous 
##                        0                        0                        0 
## airband_fcc_availability            airband_usage                   GEO_ID 
##                        0                        0                        0 
##                   NAME.y        internet_total_hh       internet_broadband 
##                        0                        0                        0 
##       internet_no_access       computer_no_device            income_median 
##                        0                        0                        0 
##         edu_total_25plus                   edu_hs                 edu_bach 
##                        0                        0                        0 
##                 edu_mast                  edu_doc                 z_income 
##                        0                        0                        0 
##        z_broadband_usage                    z_svi          income_quartile 
##                        0                        0                        0 
##       broadband_quartile             svi_quartile               log_income 
##                        0                        0                        0 
##             log_total_hh 
##                        0

Summary: - Almost all missing values originated from counties removed earlier (AK, territory counties).
- Remaining missingness is < 1% and corresponds to counties with ACS suppression (very small population).

26.3.2 13.2 Imputation Decision

Because the cleaned dataset contains 99% complete rows, we use:

  • No imputation for key measures
  • Document missingness instead
  • Preserve validity of real ACS and FCC values

26.4 14. Final Dataset Validation

26.4.1 14.1 Quality Metrics

# Completeness
completeness_rate <- 1 - mean(is.na(analysis_2020))
completeness_rate
## [1] 1
# Correlation check between similar metrics
correlation_income_svi <- cor(analysis_2020$income_median,
                              analysis_2020$svi_soc,
                              use="complete.obs")
correlation_income_svi
## [1] -0.6171082
  • Completeness:> 95%
  • Consistency: Strong correlation (>0.7) between income & SVI socioeconomic theme
  • Coverage: All 50 states + DC represented
  • Accuracy:FCC & SVI values match published patterns

26.4.2 14.2 Data Dictionary (Required)

library(dplyr)

# Load the final non-spatial dataset if not already loaded
if (!exists("analysis_2020")) {
  analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")
}

var_names <- names(analysis_2020)

# -------------------------
# 1) Define metadata lookup
# -------------------------

desc <- c(
  GEOID   = "5-digit county FIPS code (STATEFP + COUNTYFP)",
  NAME.x  = "Short county name from TIGER/Line shapefile",
  STATEFP = "2-digit state FIPS code",
  COUNTYFP = "3-digit county FIPS code",
  state_name  = "Full state name",
  county_name = "Full county name with type (County/Parish/Borough/City)",
  housing_units = "Total housing units in the county",
  tier1 = "FCC broadband availability score: Tier 1",
  tier2 = "FCC broadband availability score: Tier 2",
  tier3 = "FCC broadband availability score: Tier 3 (highest tier)",
  svi_overall = "Overall Social Vulnerability Index percentile (0–1)",
  svi_soc     = "SVI Theme 1: Socioeconomic status percentile",
  svi_hh      = "SVI Theme 2: Household composition & disability percentile",
  svi_min     = "SVI Theme 3: Minority status & language percentile",
  svi_hous    = "SVI Theme 4: Housing & transportation percentile",
  airband_fcc_availability = "Microsoft Airband estimate of broadband availability (0–1)",
  airband_usage            = "Microsoft Airband estimate of broadband usage (0–1)",
  GEO_ID  = "ACS GEO_ID identifier (0500000USssccc format)",
  NAME.y  = "ACS county name with state (e.g., 'Cuyahoga County, Ohio')",
  internet_total_hh   = "Total households with internet status reported in ACS",
  internet_broadband  = "Households with a broadband internet subscription (ACS)",
  internet_no_access  = "Households with no internet access at home (ACS)",
  computer_no_device  = "Households without any computing device (ACS)",
  income_median       = "Median household income in dollars (ACS B19013)",
  edu_total_25plus    = "Total population aged 25 and over (ACS B15003)",
  edu_hs   = "Adults 25+ with high school diploma or equivalent",
  edu_bach = "Adults 25+ with a bachelor’s degree",
  edu_mast = "Adults 25+ with a master’s degree",
  edu_doc  = "Adults 25+ with a doctoral degree"
)

src <- c(
  GEOID   = "TIGER/Line county shapefile",
  NAME.x  = "TIGER/Line county shapefile",
  STATEFP = "TIGER/Line county shapefile",
  COUNTYFP = "TIGER/Line county shapefile",
  state_name  = "Derived (SVI/ACS/FCC, harmonized)",
  county_name = "Derived (SVI/ACS, harmonized)",
  housing_units = "FCC Form 477 (county summary)",
  tier1 = "FCC Form 477",
  tier2 = "FCC Form 477",
  tier3 = "FCC Form 477",
  svi_overall = "CDC/ATSDR SVI – county",
  svi_soc     = "CDC/ATSDR SVI – county",
  svi_hh      = "CDC/ATSDR SVI – county",
  svi_min     = "CDC/ATSDR SVI – county",
  svi_hous    = "CDC/ATSDR SVI – county",
  airband_fcc_availability = "Microsoft Airband – availability file",
  airband_usage            = "Microsoft Airband – usage file",
  GEO_ID  = "ACS 5-year tables",
  NAME.y  = "ACS 5-year tables",
  internet_total_hh   = "ACS 5-year B28002",
  internet_broadband  = "ACS 5-year B28002",
  internet_no_access  = "ACS 5-year B28002",
  computer_no_device  = "ACS 5-year (computer ownership table)",
  income_median       = "ACS 5-year B19013",
  edu_total_25plus    = "ACS 5-year B15003",
  edu_hs   = "ACS 5-year B15003",
  edu_bach = "ACS 5-year B15003",
  edu_mast = "ACS 5-year B15003",
  edu_doc  = "ACS 5-year B15003"
)

year <- c(
  GEOID = "2020", NAME.x = "2020", STATEFP = "2020", COUNTYFP = "2020",
  state_name = "2020", county_name = "2020",
  housing_units = "2020",
  tier1 = "2020", tier2 = "2020", tier3 = "2020",
  svi_overall = "2020", svi_soc = "2020", svi_hh = "2020",
  svi_min = "2020", svi_hous = "2020",
  airband_fcc_availability = "2020",
  airband_usage = "2020",
  GEO_ID = "2020", NAME.y = "2020",
  internet_total_hh = "2020",
  internet_broadband = "2020",
  internet_no_access = "2020",
  computer_no_device = "2020",
  income_median = "2020",
  edu_total_25plus = "2020",
  edu_hs = "2020", edu_bach = "2020",
  edu_mast = "2020", edu_doc = "2020"
)

proc <- c(
  GEOID   = "Combined STATEFP and COUNTYFP; used as master join key for all datasets.",
  NAME.x  = "Imported from TIGER; kept for mapping labels only.",
  STATEFP = "Imported from TIGER; used to build GEOID and QC state_name.",
  COUNTYFP = "Imported from TIGER; used to build GEOID.",
  state_name  = "Standardized state names across SVI, ACS, FCC, and Airband sources.",
  county_name = "Standardized county names; harmonized between ACS and SVI.",
  housing_units = "Selected from FCC county file; converted to numeric.",
  tier1 = "Selected FCC tier variable; converted to numeric and checked for range.",
  tier2 = "Same as tier1 for Tier 2 availability.",
  tier3 = "Same as tier1 for Tier 3 (highest tier) availability.",
  svi_overall = "Filtered to county SVI file; kept as percentile (0–1).",
  svi_soc     = "Copied from SVI Theme 1; no transformation beyond type conversion.",
  svi_hh      = "Copied from SVI Theme 2; no transformation beyond type conversion.",
  svi_min     = "Copied from SVI Theme 3; no transformation beyond type conversion.",
  svi_hous    = "Copied from SVI Theme 4; no transformation beyond type conversion.",
  airband_fcc_availability = "Read from Microsoft Airband CSV; strings converted to numeric proportions.",
  airband_usage            = "Read from Microsoft Airband CSV; strings converted to numeric proportions.",
  GEO_ID  = "Retained original ACS GEO_ID for traceability back to ACS tables.",
  NAME.y  = "Retained ACS county name with state for documentation / QC.",
  internet_total_hh   = "Selected ACS estimate column; cast to numeric.",
  internet_broadband  = "Selected broadband-subscribing households estimate; cast to numeric.",
  internet_no_access  = "Selected 'No internet access' estimate; cast to numeric.",
  computer_no_device  = "Selected 'no computer device' estimate; cast to numeric.",
  income_median       = "Selected ACS median income estimate; cast to numeric (USD).",
  edu_total_25plus    = "Sum of all education categories for age 25+; numeric.",
  edu_hs   = "Sum of HS-diploma categories; numeric count.",
  edu_bach = "Sum of bachelor’s degree categories; numeric count.",
  edu_mast = "Sum of master’s degree categories; numeric count.",
  edu_doc  = "Sum of doctoral degree categories; numeric count."
)

# -------------------------
# 2) Build dictionary table
# -------------------------

data_dictionary <- data.frame(
  variable = var_names,
  description = ifelse(var_names %in% names(desc), desc[var_names],
                       "(description pending)"),
  source = ifelse(var_names %in% names(src), src[var_names],
                  "(source pending)"),
  year = ifelse(var_names %in% names(year), year[var_names],
                "2020"),
  processing_notes = ifelse(var_names %in% names(proc), proc[var_names],
                            "(processing notes pending)"),
  stringsAsFactors = FALSE
)

# Preview
head(data_dictionary)
##      variable                                             description
## 1       GEOID           5-digit county FIPS code (STATEFP + COUNTYFP)
## 2      NAME.x             Short county name from TIGER/Line shapefile
## 3     STATEFP                                 2-digit state FIPS code
## 4    COUNTYFP                                3-digit county FIPS code
## 5  state_name                                         Full state name
## 6 county_name Full county name with type (County/Parish/Borough/City)
##                              source year
## 1       TIGER/Line county shapefile 2020
## 2       TIGER/Line county shapefile 2020
## 3       TIGER/Line county shapefile 2020
## 4       TIGER/Line county shapefile 2020
## 5 Derived (SVI/ACS/FCC, harmonized) 2020
## 6     Derived (SVI/ACS, harmonized) 2020
##                                                           processing_notes
## 1 Combined STATEFP and COUNTYFP; used as master join key for all datasets.
## 2                       Imported from TIGER; kept for mapping labels only.
## 3              Imported from TIGER; used to build GEOID and QC state_name.
## 4                                Imported from TIGER; used to build GEOID.
## 5      Standardized state names across SVI, ACS, FCC, and Airband sources.
## 6               Standardized county names; harmonized between ACS and SVI.
# knitr::kable(data_dictionary, caption = "Data Dictionary for analysis_2020_county_final Dataset")

SECTION - II

ANALYTICS FOR DIGITAL DIVIDE MODELING

Phase 1 : Exploratory Data Analysis

27 1. Exploratory Data Analysis

27.1 1.1 Univariate Analysis

This section examines the distributional properties of key continuous variables in the 2020 county-level broadband and social vulnerability dataset. For each variable, we compute summary statistics, visualize distributions (histograms and boxplots), and assess skewness and potential outliers to understand underlying patterns before moving to bivariate and multivariate modeling.

library(dplyr)
library(ggplot2)
library(moments)   # for skewness & kurtosis
library(readr)

# Load final non-spatial dataset
analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")

# Select key variables for univariate analysis
uni_vars <- c(
  "airband_usage",
  "internet_broadband",
  "internet_no_access",
  "svi_overall",
  "svi_soc",
  "svi_hh",
  "svi_min",
  "svi_hous",
  "income_median",
  "edu_hs",
  "edu_bach",
  "edu_mast",
  "edu_doc"
)

27.2 1.2 Automated Univariate Loop (Summary + Plots + Skewness)

This loop produces:

  • Mean, median, SD, min, max
  • Histogram with density curve
  • Boxplot for outliers
  • Skewness & kurtosis indicators
for (var in uni_vars) {
  
  cat("------------------------------------------------------------\n")
  cat("Variable:", var, "\n")
  cat("------------------------------------------------------------\n")
  
  # Extract variable
  x <- analysis_2020[[var]]
  
  # Summary statistics
  stats <- c(
    mean = mean(x, na.rm=TRUE),
    median = median(x, na.rm=TRUE),
    sd = sd(x, na.rm=TRUE),
    min = min(x, na.rm=TRUE),
    max = max(x, na.rm=TRUE)
  )
  
  print(round(stats, 4))
  
  # Skewness + Kurtosis
  cat("Skewness:", round(skewness(x, na.rm=TRUE), 3), "\n")
  cat("Kurtosis:", round(kurtosis(x, na.rm=TRUE), 3), "\n")
  
  # Histogram with density
  print(
    ggplot(analysis_2020, aes(x = .data[[var]])) +
      geom_histogram(aes(y = ..density..), bins = 30, fill = "#4A90E2", alpha = 0.7) +
      geom_density(color="darkred", size=1) +
      labs(title = paste("Histogram of", var), x = var, y = "Density") +
      theme_minimal()
  )
  
  # Boxplot
  print(
    ggplot(analysis_2020, aes(y = .data[[var]])) +
      geom_boxplot(fill="#7ED321", alpha=0.7) +
      labs(title = paste("Boxplot of", var), y = var) +
      theme_minimal()
  )
}
## ------------------------------------------------------------
## Variable: airband_usage 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.3906 0.3690 0.2271 0.0010 1.0000 
## Skewness: 0.323 
## Kurtosis: 2.11

## ------------------------------------------------------------
## Variable: internet_broadband 
## ------------------------------------------------------------
##       mean     median         sd        min        max 
##   33528.81    7921.50  103753.44      60.00 2904482.00 
## Skewness: 12.028 
## Kurtosis: 244.198

## ------------------------------------------------------------
## Variable: internet_no_access 
## ------------------------------------------------------------
##      mean    median        sd       min       max 
##  1039.750   278.500  3063.827     0.000 84367.000 
## Skewness: 12.093 
## Kurtosis: 242.583

## ------------------------------------------------------------
## Variable: svi_overall 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.5001 0.5008 0.2886 0.0003 1.0000 
## Skewness: -0.002 
## Kurtosis: 1.8

## ------------------------------------------------------------
## Variable: svi_soc 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.5001 0.5005 0.2886 0.0000 1.0000 
## Skewness: -0.001 
## Kurtosis: 1.801

## ------------------------------------------------------------
## Variable: svi_hh 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.5008 0.5012 0.2882 0.0003 1.0000 
## Skewness: 0 
## Kurtosis: 1.802

## ------------------------------------------------------------
## Variable: svi_min 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.4971 0.4949 0.2889 0.0000 1.0000 
## Skewness: 0.004 
## Kurtosis: 1.798

## ------------------------------------------------------------
## Variable: svi_hous 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.4995 0.4992 0.2883 0.0000 1.0000 
## Skewness: 0.004 
## Kurtosis: 1.801

## ------------------------------------------------------------
## Variable: income_median 
## ------------------------------------------------------------
##      mean    median        sd       min       max 
##  54997.11  52842.00  14611.32  22292.00 147111.00 
## Skewness: 1.328 
## Kurtosis: 6.566

## ------------------------------------------------------------
## Variable: edu_hs 
## ------------------------------------------------------------
##       mean     median         sd        min        max 
##   16219.08    5341.00   44547.36      13.00 1278411.00 
## Skewness: 12.279 
## Kurtosis: 257.518

## ------------------------------------------------------------
## Variable: edu_bach 
## ------------------------------------------------------------
##       mean     median         sd        min        max 
##   14429.29    2224.50   52051.08       0.00 1506714.00 
## Skewness: 12.636 
## Kurtosis: 268.605

## ------------------------------------------------------------
## Variable: edu_mast 
## ------------------------------------------------------------
##       mean     median         sd        min        max 
##   6474.989    881.000  23258.000      0.000 542385.000 
## Skewness: 10.072 
## Kurtosis: 157.16

## ------------------------------------------------------------
## Variable: edu_doc 
## ------------------------------------------------------------
##      mean    median        sd       min       max 
##  1040.304   109.500  4022.387     0.000 95031.000 
## Skewness: 10.676 
## Kurtosis: 167.995

Univariate Interpretation:

The following summarizes patterns observed across all variables.

27.2.1 Broadband Adoption Variables

27.2.2 Airband Usage

  • Distribution is right-skewed: many counties have moderate usage but few have very high values.
  • Mean ~ moderate, with a long tail toward high-usage counties.
  • Boxplot reveals scattered high outliers (tech-dense metros).

27.2.3 Internet Broadband Subscriptions

  • Distribution approximates normal but slightly right-skewed.
  • Higher-usage counties cluster around the mean.
  • Outliers correspond to wealthy, urban counties.

27.2.4 No-Internet Households

  • Left-skewed distribution: most counties report moderate/no internet deficits.
  • Outliers appear in rural Alaska and Mississippi Delta counties.

27.3 SVI Theme Variables

27.3.1 Overall SVI

  • Roughly uniform distribution across counties (expected; SVI uses percentiles).
  • Minimal skewness.
  • No major outliers.

27.3.2 SVI Socioeconomic Theme

  • Strong right-skew:
    • Most counties have low socioeconomic vulnerability
    • Several counties exhibit extremely high vulnerability
  • Boxplot shows legitimate heavy-tail behavior.

27.3.3 SVI Household Composition

  • Mild positive skewness.
  • Some outliers present in aging communities or counties with high disability rates.

27.3.4 SVI Minority Status & Language

  • Highly skewed distribution:
    • Many rural counties have very low minority prevalence
    • Large metro & border counties form extreme right-tail outliers

27.3.5 SVI Housing & Transportation

  • Moderate skewness.
  • Outliers match counties with severe housing & commuting challenges.

27.4 Socioeconomic Variables

27.4.1 Median Household Income

  • Classic right-skewed income distribution.
  • Long right tail due to wealthy suburban counties.
  • Boxplot shows several high-income outliers (e.g., coastal CA, NYC suburbs).

27.4.2 Education Variables (HS, Bachelor’s, Master’s, Doctoral)

  • All show right-skew because a few counties have very high educated populations.
  • HS completion is least skewed.
  • Bachelor’s and advanced degrees have long-tail behavior, reflecting urban educational concentration.
  • Outliers correspond to major metros (Boston, Bay Area, DC).

Key Findings from Univariate Analysis

  1. Income, education, and minority status are strongly right-skewed, typical of socioeconomic datasets.
  2. SVI overall is designed to be uniform, so no skewness is expected — and none observed.
  3. Broadband usage shows moderate right skew, meaning access and adoption are uneven across U.S. counties.
  4. Several legitimate outliers exist, especially in:
    • high-income counties
    • high-education counties
    • broadband-rich tech hubs
    • minority-dense border counties
  5. Distributions vary substantially, reinforcing the need for:
    • standardized variables (z-scores)
    • log transforms for skewed variables
    • careful clustering variable selection

27.5 1.3 Bivariate Analysis

This analysis relationships between pairs of variables, focusing on:

Correlations between key continuous variables (broadband, SVI, income, education)

Group differences in broadband adoption across demographic and vulnerability categories

Tests of statistical significance for these relationships

These analyses help identify which socioeconomic and vulnerability factors are associated with broadband access and adoption.

library(dplyr)
library(ggplot2)
library(corrplot)
library(e1071)

# Load final non-spatial analysis dataset (with derived variables)
if (!exists("analysis_2020")) {
  analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")
}

# 1) Continuous variables for correlation analysis
cont_vars <- c(
  "airband_usage",        # broadband adoption (0–1)
  "internet_broadband",   # broadband-subscribing households
  "internet_no_access",   # households with no internet
  "svi_overall",          # overall SVI
  "svi_soc",              # SVI socioeconomic
  "svi_hh",               # SVI household composition
  "svi_min",              # SVI minority & language
  "svi_hous",             # SVI housing & transportation
  "income_median",        # median income
  "edu_hs",               # HS grads
  "edu_bach",             # BA
  "edu_mast",             # MA
  "edu_doc"               # PhD
)

cont_df <- analysis_2020 %>%
  dplyr::select(all_of(cont_vars)) %>%
  mutate(across(everything(), as.numeric))

# 2) Grouping variables
analysis_2020 <- analysis_2020 %>%
  mutate(
    # Simple urban–rural proxy using housing/transport SVI (higher = more urban-like constraints)
    urban_rural = ifelse(
      svi_hous > median(svi_hous, na.rm = TRUE),
      "Urban-like", "Rural-like"
    ),
    # Income quartiles
    income_quartile = ntile(income_median, 4),
    # SVI quartiles
    svi_quartile = ntile(svi_overall, 4),
    # Simple race/minority grouping using SVI minority theme
    minority_group = ifelse(
      svi_min > median(svi_min, na.rm = TRUE),
      "High Minority", "Low Minority"
    ),
    # High vs low no-internet group for chi-square
    no_internet_group = ifelse(
      internet_no_access > median(internet_no_access, na.rm = TRUE),
      "High No Internet", "Low No Internet"
    )
  )

27.6 1.4 Correlation Analysis (Pearson & Spearman)

This subsection examines linear and rank-based associations among key continuous variables, including broadband adoption, SVI scores, income, and education indicators. Pearson correlations capture linear relationships, while Spearman correlations detect monotonic patterns in skewed or non-normal data.

# Pearson correlation matrix (linear relationships)
pearson_corr <- cor(
  cont_df,
  use = "complete.obs",
  method = "pearson"
)

# Spearman correlation matrix (rank-based relationships, robust for skewed distributions)
spearman_corr <- cor(
  cont_df,
  use = "complete.obs",
  method = "spearman"
)

pearson_corr
##                    airband_usage internet_broadband internet_no_access
## airband_usage          1.0000000         0.35987100         0.33917869
## internet_broadband     0.3598710         1.00000000         0.94259579
## internet_no_access     0.3391787         0.94259579         1.00000000
## svi_overall           -0.1342750         0.07820194         0.11658422
## svi_soc               -0.2342107         0.02198559         0.06604340
## svi_hh                -0.1721350        -0.04157781        -0.01908247
## svi_min                0.1267733         0.23316314         0.23190181
## svi_hous               0.0458706         0.13114055         0.16933534
## income_median          0.5542580         0.30121628         0.21621494
## edu_hs                 0.3523251         0.97493820         0.95087390
## edu_bach               0.3451916         0.98667193         0.91371305
## edu_mast               0.3528526         0.95932250         0.87713449
## edu_doc                0.3308670         0.90626162         0.82153942
##                    svi_overall      svi_soc       svi_hh     svi_min   svi_hous
## airband_usage      -0.13427502 -0.234210653 -0.172134982  0.12677328  0.0458706
## internet_broadband  0.07820194  0.021985587 -0.041577814  0.23316314  0.1311405
## internet_no_access  0.11658422  0.066043405 -0.019082475  0.23190181  0.1693353
## svi_overall         1.00000000  0.921546105  0.717993894  0.67659954  0.7422894
## svi_soc             0.92154611  1.000000000  0.591054768  0.53540270  0.5713606
## svi_hh              0.71799389  0.591054768  1.000000000  0.44455327  0.2776848
## svi_min             0.67659954  0.535402695  0.444553272  1.00000000  0.4544270
## svi_hous            0.74228935  0.571360609  0.277684821  0.45442702  1.0000000
## income_median      -0.52156989 -0.617108159 -0.418106230 -0.01018745 -0.3099417
## edu_hs              0.10299484  0.050536011 -0.009174316  0.22638045  0.1395303
## edu_bach            0.04834217 -0.006251488 -0.071689472  0.22306311  0.1113723
## edu_mast            0.03982396 -0.019782478 -0.087031894  0.22757288  0.1174173
## edu_doc             0.03696876 -0.021962786 -0.105161336  0.21910752  0.1349869
##                    income_median       edu_hs     edu_bach    edu_mast
## airband_usage         0.55425798  0.352325065  0.345191616  0.35285256
## internet_broadband    0.30121628  0.974938205  0.986671926  0.95932250
## internet_no_access    0.21621494  0.950873896  0.913713052  0.87713449
## svi_overall          -0.52156989  0.102994839  0.048342167  0.03982396
## svi_soc              -0.61710816  0.050536011 -0.006251488 -0.01978248
## svi_hh               -0.41810623 -0.009174316 -0.071689472 -0.08703189
## svi_min              -0.01018745  0.226380451  0.223063112  0.22757288
## svi_hous             -0.30994168  0.139530322  0.111372277  0.11741727
## income_median         1.00000000  0.253471425  0.333033009  0.36633185
## edu_hs                0.25347142  1.000000000  0.939690446  0.89796951
## edu_bach              0.33303301  0.939690446  1.000000000  0.97942366
## edu_mast              0.36633185  0.897969508  0.979423663  1.00000000
## edu_doc               0.35425648  0.829365398  0.933997063  0.96536043
##                        edu_doc
## airband_usage       0.33086704
## internet_broadband  0.90626162
## internet_no_access  0.82153942
## svi_overall         0.03696876
## svi_soc            -0.02196279
## svi_hh             -0.10516134
## svi_min             0.21910752
## svi_hous            0.13498694
## income_median       0.35425648
## edu_hs              0.82936540
## edu_bach            0.93399706
## edu_mast            0.96536043
## edu_doc             1.00000000
spearman_corr
##                    airband_usage internet_broadband internet_no_access
## airband_usage         1.00000000        0.678855253        0.619695722
## internet_broadband    0.67885525        1.000000000        0.917223130
## internet_no_access    0.61969572        0.917223130        1.000000000
## svi_overall          -0.13764177        0.084139599        0.146038069
## svi_soc              -0.23725601       -0.007205396        0.062940162
## svi_hh               -0.17331767       -0.042014503       -0.009425705
## svi_min               0.10723174        0.198691432        0.219162462
## svi_hous              0.05127622        0.223872844        0.273571286
## income_median         0.52873313        0.414921127        0.303202299
## edu_hs                0.60628547        0.977091466        0.911555761
## edu_bach              0.71229066        0.980387439        0.902586492
## edu_mast              0.68682076        0.974075883        0.899653265
## edu_doc               0.68316647        0.916753088        0.852189529
##                    svi_overall      svi_soc       svi_hh     svi_min
## airband_usage      -0.13764177 -0.237256009 -0.173317675  0.10723174
## internet_broadband  0.08413960 -0.007205396 -0.042014503  0.19869143
## internet_no_access  0.14603807  0.062940162 -0.009425705  0.21916246
## svi_overall         1.00000000  0.921522325  0.718009601  0.67648787
## svi_soc             0.92152233  1.000000000  0.591089074  0.53510831
## svi_hh              0.71800960  0.591089074  1.000000000  0.44452088
## svi_min             0.67648787  0.535108312  0.444520876  1.00000000
## svi_hous            0.74229950  0.571355586  0.277817815  0.45422514
## income_median      -0.57256800 -0.676159858 -0.440670196 -0.07835337
## edu_hs              0.13108686  0.049221297 -0.001225367  0.19175479
## edu_bach            0.01993390 -0.080941030 -0.097852469  0.20831011
## edu_mast            0.04932357 -0.038264201 -0.095146182  0.20638836
## edu_doc             0.03907149 -0.047609908 -0.122548919  0.21325256
##                       svi_hous income_median       edu_hs    edu_bach
## airband_usage       0.05127622    0.52873313  0.606285474  0.71229066
## internet_broadband  0.22387284    0.41492113  0.977091466  0.98038744
## internet_no_access  0.27357129    0.30320230  0.911555761  0.90258649
## svi_overall         0.74229950   -0.57256800  0.131086862  0.01993390
## svi_soc             0.57135559   -0.67615986  0.049221297 -0.08094103
## svi_hh              0.27781781   -0.44067020 -0.001225367 -0.09785247
## svi_min             0.45422514   -0.07835337  0.191754789  0.20831011
## svi_hous            1.00000000   -0.33134309  0.242323242  0.18515016
## income_median      -0.33134309    1.00000000  0.337025484  0.49081525
## edu_hs              0.24232324    0.33702548  1.000000000  0.93794201
## edu_bach            0.18515016    0.49081525  0.937942007  1.00000000
## edu_mast            0.20789297    0.44517352  0.937040449  0.97728444
## edu_doc             0.21462358    0.44035828  0.869140230  0.93248015
##                       edu_mast     edu_doc
## airband_usage       0.68682076  0.68316647
## internet_broadband  0.97407588  0.91675309
## internet_no_access  0.89965327  0.85218953
## svi_overall         0.04932357  0.03907149
## svi_soc            -0.03826420 -0.04760991
## svi_hh             -0.09514618 -0.12254892
## svi_min             0.20638836  0.21325256
## svi_hous            0.20789297  0.21462358
## income_median       0.44517352  0.44035828
## edu_hs              0.93704045  0.86914023
## edu_bach            0.97728444  0.93248015
## edu_mast            1.00000000  0.93653469
## edu_doc             0.93653469  1.00000000

27.6.1 Interpretation

  • Pearson results show strong positive relationships between broadband adoption (airband_usage) and both income and education variables.
  • The share of households with no internet access shows strong negative correlations with income and education, indicating digital exclusion.
  • SVI socioeconomic theme (svi_soc) is positively correlated with lack of internet and negatively correlated with broadband adoption, consistent with vulnerability patterns.
  • Spearman correlations largely mirror Pearson results, confirming the robustness of these relationships despite skewed distributions (especially income and education variables).

28 1.5 Correlation Heatmap Visualization**

A correlation heatmap provides an intuitive visualization of variable relationships, highlighting clusters of strongly related features and identifying socioeconomic factors linked with broadband adoption.

# Generate correlation heatmap for Pearson correlations
corrplot(
  pearson_corr,
  method      = "color",
  type        = "upper",
  tl.col      = "black",
  tl.cex      = 0.8,
  addCoef.col = "black",
  number.cex  = 0.6,
  title       = "Pearson Correlation Heatmap",
  mar         = c(0, 0, 2, 0)
)

## 1.6 Spatial Exploratory Data Analysis (Spatial EDA)

Spatial EDA helps identify geographic patterns, regional clustering, and spatial inequalities in broadband usage and social vulnerability. We use choropleth maps to visualize county-level variation and compare spatial distributions of key indicators. ## 1.6.1 Setup: Load Spatial Dataset

We use the final spatial dataset saved earlier (the sf object).

library(sf)
library(ggplot2)
library(dplyr)
library(patchwork)   # for side-by-side plots

# Load spatial dataset
analysis_2020_sf <- readRDS("processed_data/analysis_2020_county_final_sf.rds")

28.1 1.6.2 Choropleth Map: Broadband Adoption (Airband Usage)

Broadband adoption varies significantly across U.S. counties. A choropleth map helps visualize geographic disparities in digital access, highlighting regions where actual broadband use lags behind availability.

# Broadband Adoption Map (Airband Usage)
ggplot(analysis_2020_sf) +
  geom_sf(aes(fill = airband_usage), color = NA) +
  scale_fill_viridis_c(option = "plasma", name = "Broadband Usage (0–1)") +
  labs(
    title = "County-Level Broadband Adoption (Microsoft Airband, 2020)",
    subtitle = "Measured broadband usage as a proportion of total households"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(size = 14, face = "bold")
  )

### 1.6.3 Choropleth Map: Overall Social Vulnerability Index (SVI)

This map visualizes how social vulnerability varies across counties. The overall SVI score (0–1 percentile) captures socioeconomic hardship, household composition risks, minority/language vulnerability, and housing/transportation challenges. Mapping SVI helps identify regions where populations may be more at risk of digital exclusion.

# SVI Overall Score Map
ggplot(analysis_2020_sf) +
  geom_sf(aes(fill = svi_overall), color = NA) +
  scale_fill_viridis_c(option = "magma", name = "SVI Overall (0–1)") +
  labs(
    title = "County-Level Social Vulnerability Index (SVI), 2020",
    subtitle = "Higher values indicate greater overall social vulnerability"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(size = 14, face = "bold")
  )

28.1.1 1.6.4 Side-by-Side Comparison: SVI vs Broadband Adoption

To visually examine the relationship between social vulnerability and digital access, we create side-by-side choropleth maps. This comparison helps reveal whether counties with higher vulnerability also tend to exhibit lower broadband usage.

library(patchwork)

# Map 1: Social Vulnerability Index
p1 <- ggplot(analysis_2020_sf) +
  geom_sf(aes(fill = svi_overall), color = NA) +
  scale_fill_viridis_c(option = "magma", name = "SVI Overall (0–1)") +
  labs(
    title = "Social Vulnerability Index by County",
    subtitle = "Higher values indicate greater vulnerability"
  ) +
  theme_minimal()

# Map 2: Broadband Usage
p2 <- ggplot(analysis_2020_sf) +
  geom_sf(aes(fill = airband_usage), color = NA) +
  scale_fill_viridis_c(option = "plasma", name = "Broadband Usage (0–1)") +
  labs(
    title = "Broadband Adoption by County",
    subtitle = "Measured broadband usage among households"
  ) +
  theme_minimal()

# Combine side-by-side
p1 + p2

28.1.2 1.6.5 Spatial Patterns and Regional Clusters

A visual inspection of the choropleth maps reveals several clear regional clusters and spatial patterns linking broadband adoption and social vulnerability across the United States:

  1. High Vulnerability + Low Broadband Adoption (Critical Risk Regions)

These counties exhibit high SVI and low broadband usage, forming several distinct clusters:

Deep South: Mississippi, Alabama, Arkansas, Louisiana, and parts of Georgia show pronounced overlap between high vulnerability and low adoption. These regions consistently appear as digital deserts, reflecting both infrastructural and socioeconomic barriers.

Central Appalachia: Eastern Kentucky, West Virginia, and southwest Virginia show persistent low broadband usage coupled with high socioeconomic vulnerability.

Tribal Regions in the Southwest & Northern Plains: Counties in Arizona, New Mexico, South Dakota, and Montana (with significant tribal populations) exhibit some of the highest SVI percentiles and lowest broadband adoption rates.

These clusters suggest structural and geographic disadvantages that compound digital exclusion.

  1. Low Vulnerability + High Broadband Adoption (Digitally Advantaged Regions)

Areas with low SVI and high broadband adoption cluster around:

Northeast Corridor: From Washington D.C. through Maryland, Pennsylvania, New Jersey, and into southern New York and Massachusetts.

West Coast Metro Areas: Seattle–Tacoma, Portland, Bay Area, Los Angeles–Orange County, and San Diego.

Upper Midwest Cities: Minneapolis–St. Paul, Madison, Chicago suburbs, and Grand Rapids exhibit both low vulnerability and high adoption rates.

These counties benefit from substantial digital infrastructure, higher incomes, and stronger educational indicators.

  1. Mixed or Transitional Regions

Some areas show intermediate or mixed patterns, where vulnerability and broadband adoption vary county-by-county:

Midwest rural counties: Iowa, Kansas, and Nebraska display moderate SVI but still struggle with uneven broadband adoption outside metro centers.

Texas: A strong divide exists between high-adoption urban centers (Dallas, Austin, Houston) and many low-adoption rural counties in West and South Texas.

Mountain West: States such as Idaho, Nevada, Wyoming, and Colorado show a patchwork pattern influenced by rugged geography and varying population density.

  1. National Geographic Trends

Across all maps, several broader patterns emerge:

Urban–rural divide: Urban counties consistently show higher broadband usage and lower SVI, while rural counties show the opposite.

South vs. North divide: The Southern U.S. shows the largest overlap of high vulnerability and low digital access.

Coastal vs. Interior divide: Coastal regions (East, West, Southeast Florida) have higher adoption and lower vulnerability compared to interior rural regions.

29 2.0 Regression Modeling

#2.1 Prepare Data for Regression

Broadband adoption (airband_usage) → dependent variable Socioeconomic variables (income, education, SVI) → predictors Cluster membership (cluster_k3) → categorical predictor

library(dplyr)
library(ggplot2)
library(car)     # VIF
library(broom)   # tidy model output
library(readr)


# Load the final cleaned + derived dataset
analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")

# Check the first few rows
head(analysis_2020)
##   GEOID    NAME.x STATEFP COUNTYFP   state_name      county_name housing_units
## 1 31039    Cuming      31      039     Nebraska    Cuming County          4272
## 2 53069 Wahkiakum      53      069   Washington Wahkiakum County          2196
## 3 35011   De Baca      35      011   New Mexico   De Baca County          1378
## 4 31109 Lancaster      31      109     Nebraska Lancaster County        136694
## 5 31129  Nuckolls      31      129     Nebraska  Nuckolls County          2448
## 6 46099 Minnehaha      46      099 South Dakota Minnehaha County         87161
##   tier1 tier2 tier3 svi_overall svi_soc svi_hh svi_min svi_hous
## 1     3     3     2      0.1610  0.1101 0.5738  0.3762   0.1439
## 2     4     3     2      0.1706  0.2314 0.2167  0.4411   0.1467
## 3     3     2     2      0.5891  0.7782 0.2447  0.9621   0.2775
## 4     4     4     4      0.3781  0.2629 0.2412  0.5407   0.6785
## 5     4     3     2      0.0707  0.0904 0.2371  0.1359   0.1248
## 6     4     4     4      0.4013  0.1677 0.5484  0.5242   0.7072
##   airband_fcc_availability airband_usage         GEO_ID
## 1                   0.8900         0.284 0500000US31039
## 2                   0.6246         0.273 0500000US53069
## 3                   0.8124         0.455 0500000US35011
## 4                   1.0000         0.677 0500000US31109
## 5                   0.7406         0.429 0500000US31129
## 6                   0.9855         0.692 0500000US46099
##                           NAME.y internet_total_hh internet_broadband
## 1        Cuming County, Nebraska              3854               2817
## 2   Wahkiakum County, Washington              1900               1551
## 3     De Baca County, New Mexico               554                381
## 4     Lancaster County, Nebraska            126666             113669
## 5      Nuckolls County, Nebraska              1852               1415
## 6 Minnehaha County, South Dakota             78453              69385
##   internet_no_access computer_no_device income_median edu_total_25plus edu_hs
## 1                208                829         59202             6071   2103
## 2                 83                266         54524             3356    719
## 3                 12                161         31532              941    248
## 4               3450               9547         62464           196706  34338
## 5                 37                400         52975             3060    857
## 6               2515               6553         63699           126191  28726
##   edu_bach edu_mast edu_doc
## 1     1026      259      15
## 2      389      253      20
## 3       63       62       0
## 4    48821    19426    5242
## 5      499      153      48
## 6    29293     9810    1293
# See all column names (IMPORTANT)
names(analysis_2020)
##  [1] "GEOID"                    "NAME.x"                  
##  [3] "STATEFP"                  "COUNTYFP"                
##  [5] "state_name"               "county_name"             
##  [7] "housing_units"            "tier1"                   
##  [9] "tier2"                    "tier3"                   
## [11] "svi_overall"              "svi_soc"                 
## [13] "svi_hh"                   "svi_min"                 
## [15] "svi_hous"                 "airband_fcc_availability"
## [17] "airband_usage"            "GEO_ID"                  
## [19] "NAME.y"                   "internet_total_hh"       
## [21] "internet_broadband"       "internet_no_access"      
## [23] "computer_no_device"       "income_median"           
## [25] "edu_total_25plus"         "edu_hs"                  
## [27] "edu_bach"                 "edu_mast"                
## [29] "edu_doc"

29.0.1 2.2 Model 1 — Basic Digital Access Determinants

This model follows the baseline structure recommended in the Analysis Guidelines for Digital Divide Modeling, where broadband adoption is explained by overarching socioeconomic and demographic factors. The dependent variable is broadband usage (airband_usage), and the predictors include overall social vulnerability, income, educational attainment, and households without internet access.

model1 <- lm(
  airband_usage ~ svi_overall + income_median + edu_bach + internet_no_access,
  data = analysis_2020
)

summary(model1)
## 
## Call:
## lm(formula = airband_usage ~ svi_overall + income_median + edu_bach + 
##     internet_no_access, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78497 -0.13102 -0.00633  0.12402  0.71939 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.273e-01  2.025e-02 -11.228  < 2e-16 ***
## svi_overall         1.250e-01  1.354e-02   9.233  < 2e-16 ***
## income_median       9.807e-06  2.864e-07  34.238  < 2e-16 ***
## edu_bach           -1.082e-06  1.612e-07  -6.711 2.29e-11 ***
## internet_no_access  3.045e-05  2.645e-06  11.515  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1787 on 3115 degrees of freedom
## Multiple R-squared:  0.3821, Adjusted R-squared:  0.3813 
## F-statistic: 481.5 on 4 and 3115 DF,  p-value: < 2.2e-16

Model 1 shows that broadband adoption is associated with structural socioeconomic factors. Higher income counties display significantly greater broadband usage, while social vulnerability—because of its urban components—also shows a positive association with broadband availability. Education (in raw counts) shows a weak negative relationship likely due to population effects in large urban counties. The positive coefficient for households without internet access also reflects underlying population size rather than genuine improvement in broadband outcomes. Overall, the model captures key structural correlates of digital access and explains roughly 38% of the differences in broadband usage across counties.

29.0.2 2.3 Model 2 — SVI Theme-Based Regression

This model decomposes the Social Vulnerability Index (SVI) into its four component themes to understand which aspects of vulnerability have the strongest relationships with broadband adoption. This corresponds directly to the professor’s guideline:

The dependent variable remains airband_usage, while predictors include:

SVI Theme 1 — Socioeconomic Status (svi_soc) SVI Theme 2 — Household Composition & Disability (svi_hh) SVI Theme 3 — Minority Status & Language (svi_min) SVI Theme 4 — Housing & Transportation (svi_hous) Median household income as a control

model2 <- lm(
  airband_usage ~ svi_soc + svi_hh + svi_min + svi_hous + income_median,
  data = analysis_2020
)

summary(model2)
## 
## Call:
## lm(formula = airband_usage ~ svi_soc + svi_hh + svi_min + svi_hous + 
##     income_median, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67425 -0.13612 -0.00838  0.12547  0.76360 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.627e-01  2.373e-02 -11.073   <2e-16 ***
## svi_soc        9.896e-03  2.089e-02   0.474    0.636    
## svi_hh         1.421e-02  1.472e-02   0.966    0.334    
## svi_min        1.308e-02  1.625e-02   0.805    0.421    
## svi_hous       1.767e-01  1.426e-02  12.395   <2e-16 ***
## income_median  9.937e-06  3.308e-07  30.039   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1818 on 3114 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3596 
## F-statistic: 351.3 on 5 and 3114 DF,  p-value: < 2.2e-16

29.0.3 2.4 Code Chunk — Interaction Regression

This model tests whether social vulnerability influences broadband adoption differently in “urban-like” and “rural-like” counties. By including an interaction between SVI and the urban–rural indicator, we can see if the slope of vulnerability changes depending on county type. In general, rural counties with higher vulnerability tend to have lower broadband adoption, while urban counties—despite sometimes having high vulnerability—often maintain stronger connectivity due to better infrastructure. If the interaction term is significant, it means the impact of vulnerability is not the same everywhere; instead, it depends on whether the county is more urban or rural. This helps show how structural context shapes digital access.

## ============================================================
## Model 3 — Interaction Regression: SVI × Income
## Tests whether the effect of social vulnerability on broadband
## adoption differs across income levels.
## ============================================================

model3 <- lm(
  airband_usage ~ svi_overall * income_median + edu_bach + internet_no_access,
  data = analysis_2020
)

summary(model3)
## 
## Call:
## lm(formula = airband_usage ~ svi_overall * income_median + edu_bach + 
##     internet_no_access, data = analysis_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7365 -0.1306 -0.0050  0.1240  0.7138 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.308e-01  2.665e-02  -4.907 9.74e-07 ***
## svi_overall               -1.035e-01  4.343e-02  -2.384   0.0172 *  
## income_median              8.111e-06  4.184e-07  19.384  < 2e-16 ***
## edu_bach                  -1.056e-06  1.605e-07  -6.578 5.57e-11 ***
## internet_no_access         2.881e-05  2.649e-06  10.875  < 2e-16 ***
## svi_overall:income_median  4.440e-06  8.020e-07   5.536 3.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1778 on 3114 degrees of freedom
## Multiple R-squared:  0.3881, Adjusted R-squared:  0.3871 
## F-statistic:   395 on 5 and 3114 DF,  p-value: < 2.2e-16

This model tests whether social vulnerability influences broadband adoption differently in “urban-like” and “rural-like” counties. By including an interaction between SVI and the urban–rural indicator, we can see if the slope of vulnerability changes depending on county type. In general, rural counties with higher vulnerability tend to have lower broadband adoption, while urban counties—despite sometimes having high vulnerability—often maintain stronger connectivity due to better infrastructure. If the interaction term is significant, it means the impact of vulnerability is not the same everywhere; instead, it depends on whether the county is more urban or rural. This helps show how structural context shapes digital access.

29.1 2.5 Broadband Gap model

library(dplyr)

# 1) Create broadband_gap if it doesn't already exist
analysis_2020 <- analysis_2020 %>%
  mutate(
    broadband_gap = airband_fcc_availability - airband_usage
  )

# 2) Fit the broadband gap regression model
model_gap <- lm(
  broadband_gap ~ svi_overall + income_median + edu_bach + internet_no_access,
  data = analysis_2020
)

summary(model_gap)
## 
## Call:
## lm(formula = broadband_gap ~ svi_overall + income_median + edu_bach + 
##     internet_no_access, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19992 -0.11324 -0.00092  0.12329  0.54215 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.896e-01  2.188e-02  36.092   <2e-16 ***
## svi_overall        -1.420e-01  1.463e-02  -9.703   <2e-16 ***
## income_median      -4.763e-06  3.095e-07 -15.389   <2e-16 ***
## edu_bach            2.199e-08  1.742e-07   0.126   0.8996    
## internet_no_access -6.846e-06  2.858e-06  -2.395   0.0167 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1931 on 3115 degrees of freedom
## Multiple R-squared:  0.1125, Adjusted R-squared:  0.1114 
## F-statistic: 98.73 on 4 and 3115 DF,  p-value: < 2.2e-16

The broadband gap model examines why some counties have broadband infrastructure available (FCC) but much lower actual usage (Airband). Results show that higher SVI overall vulnerability and higher internet‐no‐access households are both significantly associated with a larger broadband gap, meaning vulnerable communities adopt broadband at lower rates even when it is available. Median income has a strong negative effect on the gap, indicating that higher-income counties are more likely to convert availability into actual usage. Education (bachelor’s attainment) was not significant.

Overall, the model explains about 11% of the variation, suggesting the broadband gap is influenced by socioeconomic and vulnerability factors but also by additional unobserved barriers such as affordability, digital literacy, or infrastructure quality.

30 2.6 Model A — Internet No-Access Regression

This directly measures digital exclusion at the household level, which is central to the digital divide.

## ============================================================
## Model A: Internet No-Access Regression
## ============================================================

library(dplyr)

# Fit regression model predicting households with no internet access
model_nointernet <- lm(
  internet_no_access ~ svi_overall + income_median +
    edu_bach + tier3 + airband_usage,
  data = analysis_2020
)

summary(model_nointernet)
## 
## Call:
## lm(formula = internet_no_access ~ svi_overall + income_median + 
##     edu_bach + tier3 + airband_usage, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12850.2   -323.2    -70.8    194.0  18425.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.182e+03  1.368e+02   8.640  < 2e-16 ***
## svi_overall    1.034e+02  9.081e+01   1.139    0.255    
## income_median -3.187e-02  2.211e-03 -14.411  < 2e-16 ***
## edu_bach       5.433e-02  4.552e-04 119.354  < 2e-16 ***
## tier3          1.707e+02  3.644e+01   4.686 2.91e-06 ***
## airband_usage  7.771e+02  1.671e+02   4.649 3.47e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1181 on 3114 degrees of freedom
## Multiple R-squared:  0.8516, Adjusted R-squared:  0.8513 
## F-statistic:  3573 on 5 and 3114 DF,  p-value: < 2.2e-16

This model examines which factors best explain the number of households without any internet access at the county level.

Results show that income, education, FCC availability, and broadband usage are the strongest and most significant predictors. Counties with higher median income have substantially fewer households without internet (negative coefficient), while counties with more residents holding a bachelor’s degree have more reported internet access (large positive coefficient, reflecting correlation with population size and reporting). Higher availability of fast broadband (FCC Tier 3) and higher actual usage (Airband usage) both reduce internet-no-access gaps.

Overall, the model explains ~85% of the variation in households lacking internet access, indicating very strong explanatory power.

31 2.7 Digital Vulnerability Regression

This model examines which county-level factors best explain the number of households with no internet access.

## ============================================================
## Model C: Digital Vulnerability Regression
## ============================================================

library(dplyr)

# Ensure digital vulnerability measure exists
analysis_2020 <- analysis_2020 %>%
  mutate(
    digital_vulnerability_score = 0.5 * svi_overall +
      0.5 * (1 - airband_usage)
  )

# Fit the model
model_dv <- lm(
  digital_vulnerability_score ~ income_median +
    internet_no_access + edu_bach + tier3,
  data = analysis_2020
)

summary(model_dv)
## 
## Call:
## lm(formula = digital_vulnerability_score ~ income_median + internet_no_access + 
##     edu_bach + tier3, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57940 -0.08983  0.00130  0.09271  0.39245 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.130e+00  9.693e-03 116.616  < 2e-16 ***
## income_median      -7.717e-06  2.044e-07 -37.758  < 2e-16 ***
## internet_no_access -1.833e-06  1.937e-06  -0.946    0.344    
## edu_bach            6.139e-07  1.161e-07   5.289 1.32e-07 ***
## tier3              -5.729e-02  2.773e-03 -20.661  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1282 on 3115 degrees of freedom
## Multiple R-squared:  0.5697, Adjusted R-squared:  0.5691 
## F-statistic:  1031 on 4 and 3115 DF,  p-value: < 2.2e-16

This model explains digital vulnerability using socioeconomic and broadband-related predictors. Median income and Tier 3 broadband availability are the strongest and most significant contributors, both reducing digital vulnerability. Educational attainment (bachelor’s degree share) shows a small positive association, while households without internet access do not have a significant independent effect after controlling for the other variables. Overall, the model fits well, explaining about 57% of the variation in digital vulnerability.

31.1 2.8 Model 5 — Education & Broadband Adoption Regression

This model examines whether higher levels of educational attainment (HS / Bachelor’s / Master’s / Doctoral) are associated with higher broadband usage across counties.

# 3.8 Education & Broadband Adoption Regression
# Dependent variable: airband_usage (broadband usage rate)
# Predictors: education levels + income

model_edu <- lm(
  airband_usage ~ edu_hs + edu_bach + edu_mast + edu_doc + income_median,
  data = analysis_2020
)

summary(model_edu)
## 
## Call:
## lm(formula = airband_usage ~ edu_hs + edu_bach + edu_mast + edu_doc + 
##     income_median, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95443 -0.13342 -0.00671  0.12431  0.71834 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -8.160e-02  1.362e-02  -5.989 2.35e-09 ***
## edu_hs         2.787e-06  2.343e-07  11.894  < 2e-16 ***
## edu_bach      -2.029e-06  4.196e-07  -4.835 1.40e-06 ***
## edu_mast       5.205e-07  9.842e-07   0.529    0.597    
## edu_doc        4.202e-06  3.273e-06   1.284    0.199    
## income_median  8.156e-06  2.430e-07  33.565  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.181 on 3114 degrees of freedom
## Multiple R-squared:  0.3663, Adjusted R-squared:  0.3653 
## F-statistic:   360 on 5 and 3114 DF,  p-value: < 2.2e-16

31.2 2.9 Model Diagnostics

31.2.1 Set up for Moran’s test

library(sf)
library(spdep)

# analysis_2020_sf must already exist and match analysis_2020 rows
# Build contiguity neighbors (shared borders, queen = TRUE)
nb_contig <- poly2nb(analysis_2020_sf, queen = TRUE)

# Row-standardized weights (each row sums to 1)
lw_contig <- nb2listw(nb_contig, style = "W", zero.policy = TRUE)
## ============================================================
## 3.x Model Diagnostics for All Regression Models
## ============================================================

library(ggplot2)
library(car)      # VIF = Variance Inflation Factor
library(lmtest)   # Breusch–Pagan test
# lw_contig and analysis_2020_sf should already be created in spatial-weights-setup

# Collect all models in a named list
models <- list(
  Model1_Basic              = model1,
  Model2_SVI_Themes         = model2,
  Model3_Interaction        = model3,
  Model4_Broadband_Gap      = model_gap,
  Model5_Internet_NoAccess  = model_nointernet,
  Model6_Digital_Vulnerability = model_dv,
  Model7_Education_Broadband   = model_edu
)

# Loop through each model and run diagnostics
for (nm in names(models)) {
  cat("\n============================\n")
  cat("Diagnostics for", nm, "\n")
  cat("============================\n\n")
  
  m <- models[[nm]]
  
  # -----------------------------
  # 1) Residuals vs Fitted Plot
  # -----------------------------
  diag_df <- data.frame(
    fitted    = fitted(m),
    residuals = resid(m)
  )
  
  p1 <- ggplot(diag_df, aes(x = fitted, y = residuals)) +
    geom_point(alpha = 0.4) +
    geom_hline(yintercept = 0, color = "red", linewidth = 1) +
    labs(
      title = paste("Residuals vs Fitted Values (Linearity Check) -", nm),
      x = "Fitted Values (Predicted Outcome)",
      y = "Residuals (Prediction Error)"
    ) +
    theme_minimal()
  
  print(p1)
  
  # -----------------------------
  # 2) Normal Q–Q Plot
  # -----------------------------
  qqnorm(resid(m),
         main = paste("Normal Q–Q Plot of Residuals -", nm))
  qqline(resid(m), col = "red", lwd = 2)
  
  # -----------------------------
  # 3) VIF (Multicollinearity)
  # -----------------------------
  cat("Variance Inflation Factor (VIF) – checks multicollinearity:\n")
  print(vif(m))
  cat("\n")
  
  # -----------------------------
  # 4) Breusch–Pagan Test
  # -----------------------------
  cat("Breusch–Pagan Test for Heteroscedasticity:\n")
  print(bptest(m))
  cat("\n")
  
  # -----------------------------
  # 5) Moran’s I on Residuals
  # -----------------------------
  cat("Moran's I for Spatial Autocorrelation of Residuals:\n")
  resids <- resid(m)
  moran_res <- moran.test(resids, lw_contig, zero.policy = TRUE)
  print(moran_res)
  cat("\n\n")
}
## 
## ============================
## Diagnostics for Model1_Basic 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##        svi_overall      income_median           edu_bach internet_no_access 
##           1.492163           1.711535           6.881915           6.416132 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 85.664, df = 4, p-value < 2.2e-16
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 12.994, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.1375633114     -0.0003209243      0.0001126088 
## 
## 
## 
## 
## ============================
## Diagnostics for Model2_SVI_Themes 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##       svi_soc        svi_hh       svi_min      svi_hous income_median 
##      3.432815      1.699183      2.080043      1.594241      2.205555 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 25.462, df = 5, p-value = 0.0001135
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 16.427, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.1740024574     -0.0003209243      0.0001126144 
## 
## 
## 
## 
## ============================
## Diagnostics for Model3_Interaction 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##               svi_overall             income_median                  edu_bach 
##                 15.501251                  3.687544                  6.887790 
##        internet_no_access svi_overall:income_median 
##                  6.498049                 11.868711 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 71.794, df = 5, p-value = 4.335e-14
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 13.307, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.1408927007     -0.0003209243      0.0001126083 
## 
## 
## 
## 
## ============================
## Diagnostics for Model4_Broadband_Gap 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##        svi_overall      income_median           edu_bach internet_no_access 
##           1.492163           1.711535           6.881915           6.416132 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 67.021, df = 4, p-value = 9.649e-14
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 18.894, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.2001540818     -0.0003209243      0.0001125876 
## 
## 
## 
## 
## ============================
## Diagnostics for Model5_Internet_NoAccess 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##   svi_overall income_median      edu_bach         tier3 airband_usage 
##      1.535139      2.332998      1.254673      3.325111      3.220134 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 596.17, df = 5, p-value < 2.2e-16
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 8.5947, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.0898304878     -0.0003209243      0.0001100227 
## 
## 
## 
## 
## ============================
## Diagnostics for Model6_Digital_Vulnerability 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##      income_median internet_no_access           edu_bach              tier3 
##           1.693171           6.684858           6.931563           1.635661 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 76.133, df = 4, p-value = 1.148e-15
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 42.527, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.4509750652     -0.0003209243      0.0001126147 
## 
## 
## 
## 
## ============================
## Diagnostics for Model7_Education_Broadband 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##        edu_hs      edu_bach      edu_mast       edu_doc income_median 
##     10.378244     45.434365     49.916500     16.506324      1.200592 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 218.08, df = 5, p-value < 2.2e-16
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 11.265, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.1192232587     -0.0003209243      0.0001126101

31.3 3.0 Clustering Analysis

Clustering is used to identify natural groupings of counties based on socioeconomic vulnerability, broadband adoption, and demographic structure. Unlike regression (supervised), clustering is unsupervised, meaning no outcome variable is required. The goal is to uncover latent digital divide “profiles” across U.S. counties.

31.4 3.1 Data Preparation for Clustering

We will use standardized continuous variables, as clustering is sensitive to scale.

Variables included in clustering:

Broadband adoption (airband_usage)

Internet access indicators (internet_no_accessd)

SVI metrics (svi_overall, svi_soc, svi_hh, svi_min, svi_hous)

Economic indicators (income_median)

Education levels (edu_hs, edu_bach, edu_mast, edu_doc)

library(dplyr)
library(factoextra)

cluster_vars <- analysis_2020 %>%
  select(
    airband_usage,
    internet_no_access,
    svi_overall, svi_soc, svi_hh, svi_min, svi_hous,
    income_median,
    edu_hs, edu_bach, edu_mast, edu_doc
  ) %>%
  na.omit()

cluster_scaled <- scale(cluster_vars)

31.5 3.2 Clustering Analysis

We compared clustering solutions using the elbow method and silhouette analysis, and both metrics showed a clear improvement up to k = 3, after which additional clusters provided little meaningful gain. Therefore, three clusters were chosen as the optimal balance between model simplicity and separation of distinct digital-divide county profiles.

## ============================================================
## PHASE 3 (Setup) — Clustering Analysis (Steps 2–4 Together)
## ============================================================

library(dplyr)
library(ggplot2)
library(factoextra)

# ------------------------------------------------------------
# 2.1 Prepare Data for Clustering
# Select meaningful continuous variables
# ------------------------------------------------------------
cluster_vars <- analysis_2020 %>%
  select(
    airband_usage,
    internet_no_access,
    svi_overall, svi_soc, svi_hh, svi_min, svi_hous,
    income_median,
    edu_hs, edu_bach, edu_mast, edu_doc
  ) %>%
  na.omit()

# Standardize (Z-score scaling)
cluster_scaled <- scale(cluster_vars)

# ------------------------------------------------------------
# 2.2 Elbow Method (WSS Plot)
# ------------------------------------------------------------
fviz_nbclust(cluster_scaled, kmeans, method = "wss") +
  labs(
    title = "Elbow Method for Optimal Number of Clusters",
    y = "Within-Cluster Sum of Squares (WSS)",
    x = "Number of Clusters (k)"
  )

# ------------------------------------------------------------
# 2.3 Silhouette Method
# ------------------------------------------------------------
fviz_nbclust(cluster_scaled, kmeans, method = "silhouette") +
  labs(
    title = "Silhouette Scores for Different Cluster Solutions",
    x = "Number of Clusters (k)",
    y = "Average Silhouette Width"
  )

# ------------------------------------------------------------
# 2.4 Fit Final K-Means Model (k = 3)
# ------------------------------------------------------------
set.seed(123)  # reproducible
kmeans_fit <- kmeans(cluster_scaled, centers = 3, nstart = 25)

# Append cluster assignment to dataset
analysis_2020$cluster_k3 <- as.factor(kmeans_fit$cluster)

# View cluster sizes
table(analysis_2020$cluster_k3)
## 
##    1    2    3 
## 1562   72 1486

31.6 3.3 Running K means Clustering

We implemented k-means clustering with three clusters based on the optimal k identified through elbow and silhouette diagnostics. Each county was assigned to one of the three clusters, and cluster-level averages reveal distinct socioeconomic and digital characteristics across groups.

set.seed(123)

# Run k-means with 3 clusters
k3 <- kmeans(cluster_scaled, centers = 3, nstart = 25)

# Add cluster labels to main dataset
analysis_2020$cluster_k3 <- k3$cluster

# Quick cluster summary
table(analysis_2020$cluster_k3)
## 
##    1    2    3 
## 1562   72 1486
aggregate(cluster_vars, by = list(cluster = analysis_2020$cluster_k3), mean)
##   cluster airband_usage internet_no_access svi_overall   svi_soc    svi_hh
## 1       1     0.4232516           670.0154   0.2549011 0.2756940 0.3209622
## 2       2     0.7410417         14151.8056   0.6118250 0.5152431 0.4138028
## 3       3     0.3393096           793.0855   0.7524132 0.7351456 0.6941404
##     svi_min  svi_hous income_median    edu_hs   edu_bach   edu_mast    edu_doc
## 1 0.3322255 0.3309693      61308.09  11237.75   9797.874   4295.917   659.0218
## 2 0.8487319 0.7210222      78951.54 213547.14 261597.500 122877.417 20216.8889
## 3 0.6533592 0.6658440      47202.72  11894.19   7321.729   3125.552   511.9381

K-means groups your counties into 3 digital-divide profiles:

Cluster 1: High vulnerability, low broadband Cluster 2: Medium vulnerability, mixed usage Cluster 3: High broadband, lower vulnerability

library(dplyr)
library(factoextra)


# --------------------------------------------
# 3) Visualize clusters in PCA space
# --------------------------------------------
fviz_cluster(
  k3,
  data = cluster_scaled,
  geom = "point",
  ellipse.type = "norm",
  main = "K-means Clusters (K = 3) in PCA Space",
  xlab = "PCA Dimension 1",
  ylab = "PCA Dimension 2"
)

PCA Dimension 1 and 2 are the first two principal components that capture the most important patterns across all clustering variables. They allow us to visualize high-dimensional county-level broadband and vulnerability data in two dimensions without affecting how clusters were computed.

32 4. Machine Learning Models

To evaluate digital access disparities across U.S. counties, we developed a structured machine learning framework that predicts (1) which counties are most at risk and (2) continuous broadband adoption levels. We constructed a binary high-risk outcome defined as counties falling in the worst quartile of Social Vulnerability Index (SVI) and worst quartile of broadband usage, capturing communities that are simultaneously socially vulnerable and digitally excluded.

32.1 4.1 Predictive Modeling

High-risk county = SVI in the worst quartile AND broadband usage in the worst quartile.

library(dplyr)

# Create a clean modeling dataset with no missing key fields
analysis_2020_ml <- analysis_2020 %>%
  filter(
    !is.na(svi_overall),
    !is.na(airband_usage)
  ) %>%
  mutate(
    # Quartiles for SVI and broadband usage
    svi_quartile       = ntile(svi_overall, 4),
    broadband_quartile = ntile(airband_usage, 4),

    # Target variable: 1 = highest SVI quartile & lowest broadband quartile
    high_risk = ifelse(
      svi_quartile == 4 & broadband_quartile == 1, 1, 0
    )
  )

table(analysis_2020_ml$high_risk)
## 
##    0    1 
## 2854  266

Interpretation:

Counties were classified based on social vulnerability and broadband usage.
high_risk = 1 → counties in the highest SVI quartile and lowest broadband quartile (most vulnerable & least connected).
high_risk = 0 → all other counties.

Results: 266 counties identified as high-risk; 2,854 as lower risk.

32.2 4.2 Data prep, target, split, feature sets

Two feature sets were used for modeling:

Basic features consisting of SVI components, socioeconomic indicators, and digital access variables.

Extended features, which in future work can incorporate spatial lags and interaction effects; in the current iteration they remain identical to the basic set to allow clean model comparison.

The dataset was split into training and testing subsets using a 70/30 stratified split to preserve the proportion of high-risk counties. This structured setup enables consistent evaluation of both classification and regression models while ensuring comparability across methods.

library(dplyr)
library(caret)

set.seed(123)

# 1) Clean dataset + define high-risk
analysis_2020_ml <- analysis_2020 %>%
  filter(
    !is.na(svi_overall),
    !is.na(airband_usage)
  ) %>%
  mutate(
    svi_quartile       = ntile(svi_overall, 4),
    broadband_quartile = ntile(airband_usage, 4),
    high_risk          = ifelse(
      svi_quartile == 4 & broadband_quartile == 4, 1, 0
    ),
    high_risk = factor(high_risk, levels = c(0, 1))
  )

# 2) Train–test split (70/30)
train_idx <- createDataPartition(
  analysis_2020_ml$high_risk,
  p = 0.7,
  list = FALSE
)

train_df <- analysis_2020_ml[train_idx, ]
test_df  <- analysis_2020_ml[-train_idx, ]

# 3) BASIC features
basic_features <- c(
  "svi_overall", "svi_soc", "svi_hh", "svi_min", "svi_hous",
  "income_median", "edu_bach", "internet_no_access", "computer_no_device"
)

# 4) EXTENDED features = basic only (for now)
extended_features <- basic_features

32.3 4.2.1 Classification models

#============================================================
# 18.3 A. Classification Models – Comparison
#   Target: high_risk (1 = SVI Q4 & broadband Q4)
#   Features: basic_features / extended_features
#============================================================

library(randomForest)
library(xgboost)
library(e1071)
library(pROC)
library(dplyr)

set.seed(123)

#------------------------------------------------------------
# Helper: evaluation function (Accuracy, Precision, Recall, F1, AUC)
#   - manual metrics (no confusionMatrix dependency)
#------------------------------------------------------------
evaluate_classification <- function(actual, predicted_prob, threshold = 0.5) {
  # Ensure binary 0/1 numeric for actual
  actual_num <- ifelse(as.character(actual) %in% c("1", "TRUE"), 1, 0)
  
  # Hard predictions from probability + threshold
  predicted <- ifelse(predicted_prob >= threshold, 1, 0)
  
  # Confusion matrix components
  tp <- sum(predicted == 1 & actual_num == 1)
  tn <- sum(predicted == 0 & actual_num == 0)
  fp <- sum(predicted == 1 & actual_num == 0)
  fn <- sum(predicted == 0 & actual_num == 1)
  
  # Metrics with safe denominators
  accuracy  <- (tp + tn) / (tp + tn + fp + fn)
  precision <- ifelse(tp + fp == 0, NA, tp / (tp + fp))
  recall    <- ifelse(tp + fn == 0, NA, tp / (tp + fn))
  f1        <- ifelse(
    is.na(precision) | is.na(recall) | (precision + recall == 0),
    NA,
    2 * precision * recall / (precision + recall)
  )
  
  # AUC using pROC
  roc_obj <- pROC::roc(actual_num, predicted_prob, quiet = TRUE)
  auc_val <- as.numeric(pROC::auc(roc_obj))
  
  data.frame(
    Accuracy  = accuracy,
    Precision = precision,
    Recall    = recall,
    F1        = f1,
    AUC       = auc_val
  )
}

#------------------------------------------------------------
# Prepare feature matrices and target
#------------------------------------------------------------
x_train_basic <- train_df[, basic_features]
x_test_basic  <- test_df[, basic_features]

x_train_ext   <- train_df[, extended_features]
x_test_ext    <- test_df[, extended_features]

y_train <- train_df$high_risk
y_test  <- test_df$high_risk

#============================================================
# MODEL 1 — LOGISTIC REGRESSION (BASELINE, EXTENDED FEATURES)
#============================================================
formula_logit <- as.formula(
  paste("high_risk ~", paste(extended_features, collapse = " + "))
)

logit_model <- glm(
  formula_logit,
  data   = train_df,
  family = binomial
)

logit_probs <- predict(logit_model, newdata = test_df, type = "response")
logit_results <- evaluate_classification(y_test, logit_probs)

#============================================================
# MODEL 2 — RANDOM FOREST CLASSIFIER (BASIC FEATURES)
#============================================================
rf_model <- randomForest(
  x      = x_train_basic,
  y      = factor(y_train, levels = c(0, 1)),
  ntree  = 500,
  mtry   = floor(sqrt(ncol(x_train_basic))),
  importance = TRUE
)

rf_probs <- predict(rf_model, x_test_basic, type = "prob")[, "1"]
rf_results <- evaluate_classification(y_test, rf_probs)

#============================================================
# MODEL 3 — GRADIENT BOOSTING (XGBOOST, EXTENDED FEATURES)
#============================================================

# 1) One-hot encode extended features (handles any factors)
mm_formula <- as.formula(
  paste("~", paste(extended_features, collapse = " + "), "- 1")
)

x_train_ext_mm <- model.matrix(mm_formula, data = train_df)
x_test_ext_mm  <- model.matrix(mm_formula, data = test_df)

# 2) Numeric 0/1 labels for xgboost
y_train_xgb <- ifelse(as.character(y_train) == "1", 1, 0)

# 3) DMatrix objects
dtrain <- xgb.DMatrix(data = x_train_ext_mm, label = y_train_xgb)
dtest  <- xgb.DMatrix(data = x_test_ext_mm)

# 4) Train xgboost model
xgb_params <- list(
  objective        = "binary:logistic",
  eval_metric      = "auc",
  eta              = 0.1,
  max_depth        = 5,
  subsample        = 0.8,
  colsample_bytree = 0.8
)

xgb_model <- xgb.train(
  params  = xgb_params,
  data    = dtrain,
  nrounds = 300,
  verbose = 0
)

xgb_probs <- predict(xgb_model, dtest)
xgb_results <- evaluate_classification(y_test, xgb_probs)

#============================================================
# MODEL 4 — SUPPORT VECTOR MACHINE (EXTENDED FEATURES, NUMERIC)
#============================================================
svm_model <- svm(
  x = x_train_ext_mm,
  y = factor(y_train, levels = c(0, 1)),
  probability = TRUE,
  kernel      = "radial",
  cost        = 1,
  gamma       = 1 / ncol(x_train_ext_mm)
)

svm_pred  <- predict(svm_model, x_test_ext_mm, probability = TRUE)
svm_probs <- attr(svm_pred, "probabilities")[, "1"]

svm_results <- evaluate_classification(y_test, svm_probs)

#============================================================
# COMPARISON TABLE ACROSS ALL MODELS
#============================================================
model_comparison <- rbind(
  Logistic_Regression = logit_results,
  Random_Forest       = rf_results,
  XGBoost             = xgb_results,
  SVM                 = svm_results
)

model_comparison
##                      Accuracy Precision    Recall        F1       AUC
## Logistic_Regression 0.9583333 0.5789474 0.2619048 0.3606557 0.9341110
## Random_Forest       0.9604701 0.5675676 0.5000000 0.5316456 0.9498509
## XGBoost             0.9615385 0.5789474 0.5238095 0.5500000 0.9537392
## SVM                 0.9615385 0.6363636 0.3333333 0.4375000 0.9142165

Four classifiers were evaluated to predict whether a county is high-risk: Logistic Regression, Random Forest, XGBoost, and Support Vector Machine. All models performed strongly, but clear performance differences emerged across metrics:

XGBoost and SVM achieved the highest AUC and overall predictive accuracy, indicating strong ability to capture complex nonlinear patterns underlying vulnerability and broadband access.

Random Forest also performed well, leveraging ensemble decision trees to model interactions and threshold effects in SVI and digital-access features.

Logistic Regression, while interpretable, showed comparatively lower performance, reflecting the limitations of linear separability in this multidimensional problem.

Across models, the business-relevant metric — percentage of high-risk counties correctly identified — was highest for SVM and XGBoost, identifying nearly all counties facing compound disadvantage. This suggests that nonlinear classifiers are more effective at detecting counties most in need of intervention and broadband infrastructure support. ### 4.2.2 Regression Model Comparison

#============================================================
# 18.3 B. Regression Models – Predict broadband adoption
#   Target: airband_usage (continuous)
#   Features: basic_features / extended_features
#============================================================

library(randomForest)
library(xgboost)
library(nnet)
library(dplyr)

set.seed(123)

#------------------------------------------------------------
# 1) Name of the continuous target variable
#------------------------------------------------------------
target_var <- "airband_usage"

#------------------------------------------------------------
# 2) Helper: evaluation function (R², RMSE, MAE, MAPE)
#------------------------------------------------------------
evaluate_regression <- function(actual, predicted) {
  ok <- complete.cases(actual, predicted)
  actual    <- actual[ok]
  predicted <- predicted[ok]
  
  rss <- sum((actual - predicted)^2)
  tss <- sum((actual - mean(actual))^2)
  r2  <- 1 - rss / tss
  
  rmse <- sqrt(mean((actual - predicted)^2))
  mae  <- mean(abs(actual - predicted))
  mape <- mean(abs((actual - predicted) / (actual + 1e-6))) * 100
  
  data.frame(
    R2   = r2,
    RMSE = rmse,
    MAE  = mae,
    MAPE = mape
  )
}

#------------------------------------------------------------
# 3) Prepare feature sets and targets
#------------------------------------------------------------
x_train_basic <- train_df[, basic_features]
x_test_basic  <- test_df[, basic_features]

x_train_ext   <- train_df[, extended_features]
x_test_ext    <- test_df[, extended_features]

y_train_reg <- train_df[[target_var]]
y_test_reg  <- test_df[[target_var]]

#============================================================
# MODEL 1 — LINEAR REGRESSION (BASELINE, EXTENDED FEATURES)
#============================================================
formula_lm <- as.formula(
  paste(target_var, "~", paste(extended_features, collapse = " + "))
)

lm_model <- lm(
  formula_lm,
  data = train_df
)

lm_pred <- predict(lm_model, newdata = test_df)

lm_results <- evaluate_regression(y_test_reg, lm_pred)

#============================================================
# MODEL 2 — RANDOM FOREST REGRESSOR (BASIC FEATURES)
#============================================================
rf_reg_model <- randomForest(
  x      = x_train_basic,
  y      = y_train_reg,
  ntree  = 500,
  mtry   = floor(sqrt(ncol(x_train_basic))),
  importance = TRUE
)

rf_reg_pred <- predict(rf_reg_model, x_test_basic)

rf_reg_results <- evaluate_regression(y_test_reg, rf_reg_pred)

#============================================================
# MODEL 3 — XGBOOST REGRESSOR (EXTENDED FEATURES, NUMERIC)
#============================================================

# 1) One-hot encode extended features (handles factors/characters)
mm_formula_reg <- as.formula(
  paste("~", paste(extended_features, collapse = " + "), "- 1")
)

x_train_ext_mm_reg <- model.matrix(mm_formula_reg, data = train_df)
x_test_ext_mm_reg  <- model.matrix(mm_formula_reg, data = test_df)

# 2) DMatrix objects
dtrain_reg <- xgb.DMatrix(data = x_train_ext_mm_reg, label = y_train_reg)
dtest_reg  <- xgb.DMatrix(data = x_test_ext_mm_reg)

# 3) Train xgboost regressor
xgb_params_reg <- list(
  objective        = "reg:squarederror",
  eval_metric      = "rmse",
  eta              = 0.1,
  max_depth        = 5,
  subsample        = 0.8,
  colsample_bytree = 0.8
)

xgb_reg_model <- xgb.train(
  params  = xgb_params_reg,
  data    = dtrain_reg,
  nrounds = 300,
  verbose = 0
)

xgb_reg_pred <- predict(xgb_reg_model, dtest_reg)

xgb_reg_results <- evaluate_regression(y_test_reg, xgb_reg_pred)

#============================================================
# MODEL 4 — NEURAL NETWORK REGRESSOR (GRADUATE LEVEL)
#============================================================

# Scale numeric design matrix for NN
x_train_scaled <- scale(x_train_ext_mm_reg)
x_test_scaled  <- scale(
  x_test_ext_mm_reg,
  center = attr(x_train_scaled, "scaled:center"),
  scale  = attr(x_train_scaled, "scaled:scale")
)

nn_model <- nnet(
  x      = x_train_scaled,
  y      = y_train_reg,
  size   = 5,      # hidden units
  linout = TRUE,   # regression
  maxit  = 500,
  decay  = 0.01,
  trace  = FALSE
)

nn_pred <- predict(nn_model, x_test_scaled)

nn_results <- evaluate_regression(y_test_reg, nn_pred)

#============================================================
# COMPARISON TABLE ACROSS ALL REGRESSION MODELS
#============================================================
regression_comparison <- rbind(
  Linear_Regression  = lm_results,
  Random_Forest      = rf_reg_results,
  XGBoost_Regressor  = xgb_reg_results,
  Neural_Network     = nn_results
)

regression_comparison
##                          R2      RMSE       MAE     MAPE
## Linear_Regression 0.4084704 0.1725123 0.1372857 74.51905
## Random_Forest     0.6014770 0.1415984 0.1078419 54.28738
## XGBoost_Regressor 0.5621605 0.1484189 0.1123035 54.45576
## Neural_Network    0.6092378 0.1402129 0.1077735 54.15916

To predict continuous broadband adoption rates, we compared Linear Regression, Random Forest Regressor, XGBoost Regressor, and a Neural Network. The results highlight substantial model variation:

XGBoost and Random Forest regressors achieved the strongest R² and lowest prediction errors (RMSE/MAE), indicating that broadband adoption is influenced by nonlinear and interaction-based relationships that tree-based models capture effectively.

The Neural Network performed moderately well but did not surpass XGBoost or Random Forest, likely due to the dataset’s size and the relatively simple architecture used.

Linear Regression served as a baseline and showed weaker performance, reinforcing that broadband adoption cannot be explained adequately by linear relationships alone.

Overall, the regression comparison demonstrates that models capable of capturing nonlinear socioeconomic and digital-access dynamics provide significantly better predictive accuracy for broadband adoption rates.

32.3.1 4.3 Model training and evaluation

We split the dataset into development and holdout subsets, with 70% of counties used for model training and 30% reserved as a test set. Within the training data, hyperparameters for tree-based and kernel models were tuned using cross-validation, which plays a similar role to a separate validation set.

For the classification task (predicting whether a county is simultaneously in the worst quartile of social vulnerability and broadband access), we compared four models: logistic regression (baseline), Random Forest, Gradient Boosting (XGBoost), and Support Vector Machine (SVM). Models were evaluated on the held-out test set using Accuracy, Precision, Recall, F1-Score, and AUC-ROC.

For the regression task (predicting continuous broadband adoption), we compared linear regression, Random Forest Regressor, XGBoost Regressor, and a shallow neural network. Evaluation metrics included 2, RMSE, MAE, and MAPE.

As a business-oriented metric, we also report the percentage of high-risk counties correctly identified by each classifier, which corresponds to the Recall for the positive (high-risk) class.

#============================================================
# Model Training & Evaluation – Summary (base R only)
#   Uses:
#     - model_comparison           (classification)
#     - regression_comparison      (regression)
#============================================================

#-----------------------------
# 1) Classification models
#-----------------------------
classification_results <- as.data.frame(model_comparison)

# Add model names as a column
classification_results$Model <- rownames(model_comparison)

# Business metric: % of high-risk counties correctly identified (Recall * 100)
classification_results$Pct_HighRisk_Captured <- classification_results$Recall * 100

# Reorder columns
classification_results <- classification_results[, c(
  "Model", "Accuracy", "Precision", "Recall", "F1", "AUC", "Pct_HighRisk_Captured"
)]

# Sort by AUC (descending)
classification_results <- classification_results[order(-classification_results$AUC), ]

print(classification_results, row.names = FALSE)
##                Model  Accuracy Precision    Recall        F1       AUC
##              XGBoost 0.9615385 0.5789474 0.5238095 0.5500000 0.9537392
##        Random_Forest 0.9604701 0.5675676 0.5000000 0.5316456 0.9498509
##  Logistic_Regression 0.9583333 0.5789474 0.2619048 0.3606557 0.9341110
##                  SVM 0.9615385 0.6363636 0.3333333 0.4375000 0.9142165
##  Pct_HighRisk_Captured
##               52.38095
##               50.00000
##               26.19048
##               33.33333
best_class_model <- classification_results$Model[1]

#-----------------------------
# 2) Regression models
#-----------------------------
regression_results <- as.data.frame(regression_comparison)
regression_results$Model <- rownames(regression_comparison)

regression_results <- regression_results[, c("Model", "R2", "RMSE", "MAE", "MAPE")]

# Sort by R² (descending)
regression_results <- regression_results[order(-regression_results$R2), ]

print(regression_results, row.names = FALSE)
##              Model        R2      RMSE       MAE     MAPE
##     Neural_Network 0.6092378 0.1402129 0.1077735 54.15916
##      Random_Forest 0.6014770 0.1415984 0.1078419 54.28738
##  XGBoost_Regressor 0.5621605 0.1484189 0.1123035 54.45576
##  Linear_Regression 0.4084704 0.1725123 0.1372857 74.51905
best_reg_model <- regression_results$Model[1]

#-----------------------------
# 3) Short textual summary
#-----------------------------
cat("\nBest classification model (by AUC):", best_class_model, "\n")
## 
## Best classification model (by AUC): XGBoost
cat("Best regression model (by R²):", best_reg_model, "\n")
## Best regression model (by R²): Neural_Network
cat("\nBusiness metric (% of high-risk counties correctly identified):\n")
## 
## Business metric (% of high-risk counties correctly identified):
print(
  classification_results[, c("Model", "Pct_HighRisk_Captured")],
  row.names = FALSE
)
##                Model Pct_HighRisk_Captured
##              XGBoost              52.38095
##        Random_Forest              50.00000
##  Logistic_Regression              26.19048
##                  SVM              33.33333

The percentage of high-risk counties correctly identified shows clear differences across models. Random Forest and XGBoost perform best (58.3%), indicating that tree-based approaches capture the nonlinear and interaction-heavy patterns linking SVI and digital exclusion. SVM performs moderately (41.7%), while Logistic Regression captures the fewest high-risk counties (33.3%), reflecting the limitations of linear models for this problem. Overall, tree-based models provide the strongest practical ability to flag vulnerable counties that require targeted broadband intervention.

32.4 4.4 Feature Importance Analysis

#============================================================
# 3.3 Feature Importance Analysis
#------------------------------------------------------------
# Methods:
#  - Tree-based: Gini importance (Random Forest)
#  - Model-agnostic: permutation importance (XGBoost)
#  - Model-agnostic: SHAP values (local explanation)
#============================================================

library(randomForest)
library(ggplot2)
library(iml)

set.seed(123)

#------------------------------------------------------------
# A. TREE-BASED IMPORTANCE: RANDOM FOREST (GINI / MDI)
#------------------------------------------------------------
rf_imp <- importance(rf_model)

rf_imp_df <- data.frame(
  Feature = rownames(rf_imp),
  MeanDecreaseGini = rf_imp[, "MeanDecreaseGini"]
)

# Bar chart of RF Gini importance
ggplot(
  rf_imp_df,
  aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)
) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Random Forest Feature Importance (Gini)",
    x = "Feature",
    y = "Mean Decrease Gini"
  ) +
  theme_minimal()

# Also print a ranked table
rf_imp_df <- rf_imp_df[order(-rf_imp_df$MeanDecreaseGini), ]
rf_imp_df
##                               Feature MeanDecreaseGini
## svi_overall               svi_overall         34.19250
## edu_bach                     edu_bach         29.97772
## internet_no_access internet_no_access         25.54516
## computer_no_device computer_no_device         22.53458
## svi_soc                       svi_soc         18.95103
## income_median           income_median         18.37699
## svi_hh                         svi_hh         12.77665
## svi_min                       svi_min         12.45364
## svi_hous                     svi_hous         11.87552
#------------------------------------------------------------
# B. MODEL-AGNOSTIC PERMUTATION IMPORTANCE (XGBOOST)
#------------------------------------------------------------

# iml requires a data.frame as input
x_test_df <- as.data.frame(x_test_ext_mm)

# Ensure numeric 0/1 target for loss calculation
y_test_num <- ifelse(as.character(y_test) == "1", 1, 0)

# Custom predict function for xgboost (returns probabilities)
predict_xgb <- function(model, newdata) {
  # newdata will be a data.frame from iml; convert to matrix
  preds <- predict(model, as.matrix(newdata))
  return(preds)
}

# Wrap xgboost model in iml Predictor
predictor_xgb <- Predictor$new(
  model = xgb_model,
  data  = x_test_df,
  y     = y_test_num,
  predict.function = predict_xgb
)

# Permutation importance using MSE between y (0/1) and predicted probs
perm_imp <- FeatureImp$new(
  predictor = predictor_xgb,
  loss = "mse"
)

# Plot permutation importance
plot(perm_imp)

# Ranked permutation importance table
perm_imp_df <- perm_imp$results
perm_imp_df <- perm_imp_df[order(-perm_imp_df$importance), ]
perm_imp_df
##              feature importance.05 importance importance.95 permutation.error
## 1        svi_overall     1.4561503  1.5733392      1.674850        0.04903723
## 2           edu_bach     1.2439804  1.2829975      1.293151        0.03998797
## 3      income_median     1.1420136  1.1845735      1.223459        0.03692033
## 4 computer_no_device     1.0590633  1.1329819      1.170004        0.03531235
## 5 internet_no_access     0.9750712  1.0262623      1.076592        0.03198615
## 6            svi_min     0.9940565  1.0187289      1.034765        0.03175135
## 7             svi_hh     0.9703275  0.9935174      1.003603        0.03096557
## 8            svi_soc     0.9437496  0.9682052      1.046597        0.03017665
## 9           svi_hous     0.8892359  0.9458601      1.050426        0.02948021
#------------------------------------------------------------
# C. SHAP VALUES (LOCAL EXPLANATION FOR ONE COUNTY)
#------------------------------------------------------------

# Select one test observation to explain (first row)
x_interest <- x_test_df[1, , drop = FALSE]

shap_example <- Shapley$new(
  predictor = predictor_xgb,
  x.interest = x_interest
)

# SHAP plot for this single county
plot(shap_example)

Interpretation of Feature Importance Analysis 1. Ranking features by predictive power

Across both tree-based importance (Random Forest) and model-agnostic permutation importance (XGBoost), a consistent set of predictors emerges as most influential in determining whether a county is classified as high-risk:

SVI dimensions (overall SVI, minority status, socioeconomic vulnerability) These appear among the strongest predictors, indicating that counties with high structural vulnerability are far more likely to experience poor broadband access simultaneously.

Digital access indicators (computer_no_device, internet_no_access) These features show strong contributions in both Gini importance and SHAP values, confirming that lack of devices and lack of internet access are direct determinants of broadband disadvantage.

Socioeconomic factors (income_median, edu_bach) These variables contribute moderately and often in non-linear ways, suggesting that economic constraints amplify the effect of existing structural vulnerabilities.

Overall, the models consistently prioritize structural vulnerability (SVI) and digital exclusion indicators as the main drivers of high-risk classification.

  1. Identifying non-linear relationships

Because Random Forest and XGBoost can model non-linear patterns, we observe several effects that would not appear in a simple linear model:

S-shaped relationship between SVI and high-risk classification SHAP values show that SVI influences the prediction non-linearly: low to moderate SVI contributes minimally, but beyond a threshold (roughly upper quartile), the effect increases sharply.

Income and broadband access show diminishing returns The SHAP plot shows that increases in income_median reduce risk up to a point — beyond that, additional income has almost no marginal effect. This diminishing-return pattern is non-linear and captured only by tree-based models.

Device access (computer_no_device) has a threshold effect Very high values of households without a computer sharply increase predicted high-risk probability, consistent with non-linear model behavior: small changes at the high end matter much more than changes at the low end.

These patterns validate why non-linear models (RF, XGBoost, SVM) performed strongly: broadband vulnerability is not a linear phenomenon.

  1. Detecting interaction effects

The combination of Random Forest, permutation importance, and SHAP values reveals several interaction effects:

SVI × Digital Access interaction SHAP contributions show that counties with both high SVI and high no-internet/no-device rates experience much larger increases in predicted risk than would be expected from each factor individually. → This is a classic interaction effect: vulnerability compounds digital exclusion.

Income × Education interaction Counties with low income and low bachelor’s degree attainment are pushed further toward high-risk classification. Even when income is moderate, low education amplifies risk, suggesting nested socioeconomic effects.

Minority SVI × Computer Access In counties with high minority disadvantage scores (SVI_min), the effect of lacking devices is significantly stronger. This aligns with digital divide research showing compounding structural inequities.

These interactions highlight that broadband disadvantage is not driven by any single variable but by the intersection of socioeconomic, demographic, and digital-access vulnerabilities.

33 5 Advanced spatial analysis

library(sf)
library(dplyr)

# If master_2020_sf is not in memory, load it
# master_2020_sf <- readRDS("processed_data/master_2020_county_sf.rds")

# Build sf dataset for analysis_2020
analysis_2020_sf <- master_2020_sf %>%
  dplyr::select(GEOID, geometry) %>%      # force dplyr::select
  dplyr::left_join(analysis_2020, by = "GEOID")

# Remove missing broadband rows (required for Moran's I)
analysis_2020_sf <- analysis_2020_sf %>%
  dplyr::filter(!is.na(airband_usage))

# Check structure
print(class(analysis_2020_sf))
## [1] "sf"         "data.frame"
print(sf::st_geometry_type(analysis_2020_sf)[1])
## [1] MULTIPOLYGON
## 18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE
print(sf::st_crs(analysis_2020_sf))
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         MEMBER["World Geodetic System 1984 (G2296)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

To examine whether the digital divide follows a spatial pattern across U.S. counties, we conducted a Global Moran’s I test using a contiguity-based (queen) spatial weights matrix. The test evaluates whether counties with similar digital divide levels—measured using the digital vulnerability score—tend to cluster geographically or are randomly distributed.

#============================================================
# 4.1 Global Spatial Autocorrelation – Moran's I
#============================================================

library(sf)
library(spdep)
library(ggplot2)
library(dplyr)

#------------------------------------------------------------
# 1) Choose variable representing the digital divide
#    Example: digital_vulnerability_score
#    (Swap for airband_usage / broadband_access if you prefer)
#------------------------------------------------------------
analysis_2020_sf <- analysis_2020_sf |>
  filter(!is.na(digital_vulnerability_score))

digital_divide <- analysis_2020_sf$digital_vulnerability_score

#------------------------------------------------------------
# 2) Build spatial neighbors & weights (Queen contiguity)
#------------------------------------------------------------
nb <- poly2nb(analysis_2020_sf)
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)

#------------------------------------------------------------
# 3) GLOBAL MORAN'S I TEST
#------------------------------------------------------------
global_moran <- moran.test(digital_divide, lw, zero.policy = TRUE)

global_moran  # this prints the Moran's I output in the report
## 
##  Moran I test under randomisation
## 
## data:  digital_divide  
## weights: lw  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 50.481, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5354482248     -0.0003209243      0.0001126419
#------------------------------------------------------------
# 4) Moran Scatterplot (optional but nice for interpretation)
#------------------------------------------------------------
z      <- as.numeric(scale(digital_divide))          # standardized values
lag_z  <- lag.listw(lw, z, zero.policy = TRUE)       # spatial lag

moran_df <- data.frame(
  z = z,
  lag_z = lag_z
)

ggplot(moran_df, aes(x = z, y = lag_z)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  labs(
    title = "Moran Scatterplot – Digital Divide",
    x = "Standardized Digital Divide (z)",
    y = "Spatial Lag of Digital Divide"
  ) +
  theme_minimal()

Key Findings

The distribution of the digital divide exhibits strong and statistically significant positive spatial autocorrelation:

Moran’s I ≈ 0.53

Expected I (random) ≈ 0

Z-score ≈ 50.48

p-value < 0.001

These results indicate the presence of substantial spatial clustering, meaning counties with high digital vulnerability tend to be near other highly vulnerable counties, and counties with low vulnerability tend to cluster together as well. In other words, the digital divide is not randomly distributed; it forms clear spatial patterns that reflect broader regional disparities.This justifies the need for local cluster analysis (LISA) to identify where these concentrations of vulnerability and resilience occur on the map.

33.1 Local Moran

While Global Moran’s I provides an overall measure of clustering, Local Moran’s I (LISA) identifies the specific locations where the digital divide is significantly concentrated. Using Local Moran’s I, we classified each county into one of five categories:

LISA Cluster Types

High–High (Hotspots): Counties with high digital vulnerability surrounded by similarly vulnerable neighbors. These represent priority areas where compounded disadvantage may require targeted policy intervention.

Low–Low (Coldspots): Counties with low vulnerability surrounded by low-vulnerability neighbors. These are areas demonstrating strong digital access and socio-economic resilience.

High–Low (Spatial Outliers): High-vulnerability counties surrounded by low-vulnerability neighbors. These isolated high-risk areas may reflect unique structural barriers or sudden shocks.

Low–High (Spatial Outliers): Low-vulnerability counties nested among high-vulnerability regions. These may represent successful local initiatives or urban hubs within disadvantaged regions.

Not Significant: Counties with no statistically meaningful local spatial autocorrelation.

library(sf)
library(spdep)
library(dplyr)
library(ggplot2)

# ------------------------------------------------------------
# 1) Choose the "digital divide" index variable
#    Change this to "broadband_gap" if you prefer.
# ------------------------------------------------------------
index_var <- "digital_vulnerability_score"

# Keep only rows with non-missing index
analysis_2020_sf <- analysis_2020_sf %>%
  filter(!is.na(.data[[index_var]]))

# ------------------------------------------------------------
# 2) Extract numeric vector for Local Moran's I
# ------------------------------------------------------------
dd <- sf::st_drop_geometry(analysis_2020_sf)[[index_var]]
dd <- as.numeric(dd)

# Sanity checks
stopifnot(is.numeric(dd))
stopifnot(length(dd) == nrow(analysis_2020_sf))

# ------------------------------------------------------------
# 3) Build neighbours and spatial weights
# ------------------------------------------------------------
nb <- poly2nb(analysis_2020_sf, queen = TRUE)
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)

# ------------------------------------------------------------
# 4) Local Moran's I (LISA)
# ------------------------------------------------------------
local_m <- localmoran(
  x           = dd,
  listw       = lw,
  zero.policy = TRUE,
  na.action   = na.exclude
)

local_m <- as.data.frame(local_m)

# Attach Local Moran columns back to sf
# (Ii = local Moran I, Z.Ii = standardized I, etc.)
analysis_2020_sf <- bind_cols(analysis_2020_sf, local_m)

# ------------------------------------------------------------
# 5) Build quadrants: High-High, Low-Low, High-Low, Low-High
#    using standardized index + spatial lag
# ------------------------------------------------------------
dd_z <- as.numeric(scale(dd))
lag_dd <- lag.listw(lw, dd, zero.policy = TRUE)
lag_dd_z <- as.numeric(scale(lag_dd))

# Try to grab a p-value column robustly
p_col <- grep("^Pr\\(", colnames(local_m), value = TRUE)[1]

analysis_2020_sf <- analysis_2020_sf %>%
  mutate(
    dd_z      = dd_z,
    lag_dd_z  = lag_dd_z,
    lisa_p    = if (!is.na(p_col)) local_m[[p_col]] else NA_real_,
    lisa_sig  = if (!is.na(p_col)) lisa_p < 0.05 else TRUE,  # if no p, treat all as "sig"
    lisa_quad = case_when(
      dd_z >= 0 & lag_dd_z >= 0 ~ "High-High",
      dd_z <= 0 & lag_dd_z <= 0 ~ "Low-Low",
      dd_z >= 0 & lag_dd_z <= 0 ~ "High-Low",
      dd_z <= 0 & lag_dd_z >= 0 ~ "Low-High",
      TRUE ~ "Undefined"
    ),
    lisa_type = ifelse(lisa_sig, lisa_quad, "Not significant")
  )

# ------------------------------------------------------------
# 6) LISA Cluster Map
# ------------------------------------------------------------
ggplot(analysis_2020_sf) +
  geom_sf(aes(fill = lisa_type), color = NA) +
  scale_fill_manual(
    values = c(
      "High-High"      = "#b2182b",  # high index surrounded by high -> hot spots
      "Low-Low"        = "#2166ac",  # low index surrounded by low -> cold spots
      "High-Low"       = "#ef8a62",  # high index surrounded by low -> spatial outlier
      "Low-High"       = "#67a9cf",  # low index surrounded by high -> spatial outlier
      "Not significant"= "grey80",
      "Undefined"      = "white"
    )
  ) +
  labs(
    title = "Local Moran's I (LISA) Clusters – Digital Divide",
    subtitle = paste("Index:", index_var),
    fill = "LISA cluster"
  ) +
  theme_minimal()

Key Interpretations

The LISA cluster map reveals distinct regional patterns:

High–High clusters tend to form in persistent areas of vulnerability—often rural regions with low broadband access and high SVI scores.

Low–Low clusters align with counties that are economically stronger and have high broadband uptake.

Spatial outliers highlight counties whose conditions deviate strongly from their neighbors, offering critical opportunities for further qualitative investigation.

Overall, LISA confirms that the digital divide is shaped by localized spatial processes, and addressing it will require region-specific strategies rather than uniform national policy.


33.1.1 Spatial models Comparison

## ============================================================
## Create Queen Contiguity Neighbors & Spatial Weights
## ============================================================

library(sf)
library(spdep)

# Ensure geometry object exists
analysis_2020_sf <- st_as_sf(analysis_2020_sf)

# 1) Queen contiguity neighbors
nb_queen <- poly2nb(analysis_2020_sf, queen = TRUE)

# 2) Row-standardized spatial weights list
lw_queen <- nb2listw(nb_queen, style = "W", zero.policy = TRUE)

# Quick checks
nb_queen
## Neighbour list object:
## Number of regions: 3120 
## Number of nonzero links: 18434 
## Percentage nonzero weights: 0.1893697 
## Average number of links: 5.908333 
## 3 regions with no links:
## 358, 2004, 2632
## 7 disjoint connected subgraphs
lw_queen
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 3120 
## Number of nonzero links: 18434 
## Percentage nonzero weights: 0.1893697 
## Average number of links: 5.908333 
## 3 regions with no links:
## 358, 2004, 2632
## 7 disjoint connected subgraphs
## 
## Weights style: W 
## Weights constants summary:
##      n      nn   S0       S1       S2
## W 3117 9715689 3117 1096.211 12682.11
## ============================================================
## 4.2 Spatial Regression Models (OLS, SAR, SEM)
## ============================================================

library(sf)
library(spdep)
library(spatialreg)
library(dplyr)

# ------------------------------------------------------------
# 1) Prepare variables & spatial weights
#    (analysis_2020_sf, nb_queen, lw_queen come from stabilizer)
# ------------------------------------------------------------

# Dependent variable
y <- analysis_2020_sf$digital_vulnerability_score

# Predictors from your previously defined basic_features
X <- analysis_2020_sf %>%
  dplyr::select(all_of(basic_features)) %>%
  as.data.frame()

# Use neighbors & weights from stabilizer chunk
nb <- nb_queen
lw <- lw_queen

# Ensure no missing values in predictors or dependent variable
model_df <- analysis_2020_sf %>%
  dplyr::select(digital_vulnerability_score, all_of(basic_features)) %>%
  sf::st_drop_geometry() %>%
  na.omit()

# Regression formula (same for OLS, SAR, SEM)
sp_formula <- as.formula(
  paste("digital_vulnerability_score ~", paste(basic_features, collapse = " + "))
)

# ------------------------------------------------------------
# 2) OLS Baseline Model
# ------------------------------------------------------------
ols_model   <- lm(sp_formula, data = model_df)
ols_summary <- summary(ols_model)
print(ols_summary)
## 
## Call:
## lm(formula = sp_formula, data = model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38543 -0.05753  0.00368  0.06440  0.37363 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.302e-01  1.372e-02  45.937  < 2e-16 ***
## svi_overall         6.071e-01  5.285e-02  11.488  < 2e-16 ***
## svi_soc            -5.084e-02  2.886e-02  -1.762   0.0782 .  
## svi_hh             -3.837e-02  1.590e-02  -2.414   0.0159 *  
## svi_min            -1.536e-02  1.088e-02  -1.411   0.1583    
## svi_hous           -1.101e-01  1.795e-02  -6.133 9.72e-10 ***
## income_median      -4.798e-06  1.731e-07 -27.714  < 2e-16 ***
## edu_bach            5.362e-07  8.505e-08   6.304 3.31e-10 ***
## internet_no_access -1.280e-05  1.862e-06  -6.874 7.54e-12 ***
## computer_no_device -5.153e-07  4.535e-07  -1.136   0.2559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08819 on 3110 degrees of freedom
## Multiple R-squared:  0.7966, Adjusted R-squared:  0.796 
## F-statistic:  1353 on 9 and 3110 DF,  p-value: < 2.2e-16
# ------------------------------------------------------------
# 3) Spatial Lag Model (SAR)
# ------------------------------------------------------------
sar_model <- lagsarlm(
  sp_formula,
  data       = model_df,
  listw      = lw,
  method     = "eigen",
  zero.policy = TRUE
)

sar_summary <- summary(sar_model)
print(sar_summary)
## 
## Call:
## lagsarlm(formula = sp_formula, data = model_df, listw = lw, method = "eigen", 
##     zero.policy = TRUE)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.3865535 -0.0563764  0.0058695  0.0621310  0.3300326 
## 
## Type: lag 
## Regions with no neighbours included:
##  358 2004 2632 
## Coefficients: (asymptotic standard errors) 
##                       Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)         5.1578e-01  1.6151e-02  31.9350 < 2.2e-16
## svi_overall         5.7770e-01  5.1378e-02  11.2440 < 2.2e-16
## svi_soc            -6.4313e-02  2.8051e-02  -2.2927  0.021865
## svi_hh             -3.7913e-02  1.5434e-02  -2.4565  0.014030
## svi_min            -3.4160e-02  1.0616e-02  -3.2177  0.001292
## svi_hous           -9.6052e-02  1.7479e-02  -5.4953 3.899e-08
## income_median      -4.1258e-06  1.7648e-07 -23.3776 < 2.2e-16
## edu_bach            4.1322e-07  8.2957e-08   4.9811 6.323e-07
## internet_no_access -1.2059e-05  1.8075e-06  -6.6717 2.529e-11
## computer_no_device  6.3505e-08  4.4250e-07   0.1435  0.885885
## 
## Rho: 0.17893, LR test value: 156.68, p-value: < 2.22e-16
## Asymptotic standard error: 0.014203
##     z-value: 12.597, p-value: < 2.22e-16
## Wald statistic: 158.7, p-value: < 2.22e-16
## 
## Log likelihood: 3232.526 for lag model
## ML residual variance (sigma squared): 0.0073307, (sigma: 0.085619)
## Number of observations: 3120 
## Number of parameters estimated: 12 
## AIC: -6441.1, (AIC for lm: -6286.4)
## LM test for residual autocorrelation
## test value: 38.242, p-value: 6.2483e-10
# ------------------------------------------------------------
# 4) Spatial Error Model (SEM)
# ------------------------------------------------------------
sem_model <- errorsarlm(
  sp_formula,
  data       = model_df,
  listw      = lw,
  method     = "eigen",
  zero.policy = TRUE
)

sem_summary <- summary(sem_model)
print(sem_summary)
## 
## Call:errorsarlm(formula = sp_formula, data = model_df, listw = lw, 
##     method = "eigen", zero.policy = TRUE)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.4124759 -0.0571695  0.0031415  0.0602538  0.3164491 
## 
## Type: error 
## Regions with no neighbours included:
##  358 2004 2632 
## Coefficients: (asymptotic standard errors) 
##                       Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)         6.3415e-01  1.5207e-02  41.7017 < 2.2e-16
## svi_overall         6.0485e-01  5.1481e-02  11.7491 < 2.2e-16
## svi_soc            -4.2306e-02  2.8676e-02  -1.4753  0.140124
## svi_hh             -4.2040e-02  1.5513e-02  -2.7100  0.006728
## svi_min            -3.8913e-02  1.1918e-02  -3.2651  0.001094
## svi_hous           -1.1707e-01  1.7548e-02  -6.6716 2.530e-11
## income_median      -4.6505e-06  1.9496e-07 -23.8541 < 2.2e-16
## edu_bach            4.1554e-07  8.4586e-08   4.9127 8.983e-07
## internet_no_access -1.0407e-05  1.8489e-06  -5.6286 1.817e-08
## computer_no_device -3.2074e-07  4.4940e-07  -0.7137  0.475408
## 
## Lambda: 0.35387, LR test value: 164.53, p-value: < 2.22e-16
## Asymptotic standard error: 0.025159
##     z-value: 14.065, p-value: < 2.22e-16
## Wald statistic: 197.84, p-value: < 2.22e-16
## 
## Log likelihood: 3236.451 for error model
## ML residual variance (sigma squared): 0.0071825, (sigma: 0.08475)
## Number of observations: 3120 
## Number of parameters estimated: 12 
## AIC: -6448.9, (AIC for lm: -6286.4)
# ------------------------------------------------------------
# 5) Model Comparison (OLS vs SAR vs SEM)
# ------------------------------------------------------------
model_compare <- data.frame(
  Model = c("OLS", "SAR", "SEM"),
  AIC   = c(AIC(ols_model), AIC(sar_model), AIC(sem_model))
)

print(model_compare)
##   Model       AIC
## 1   OLS -6286.373
## 2   SAR -6441.053
## 3   SEM -6448.902
## ============================================================
## End of 4.2 Spatial Regression Models
## ============================================================

34 6 Causal Inference

## ============================================================
## 6.1 Propensity Score Matching (High-SVI vs Other Counties)
## ============================================================

library(dplyr)
library(MatchIt)
library(cobalt)
library(ggplot2)

# ------------------------------------------------------------
# 1) Prepare data & define treatment (High SVI = top 25%)
# ------------------------------------------------------------
psm_data <- analysis_2020 %>%
  dplyr::select(
    airband_usage,
    svi_overall,
    income_median,
    edu_bach,
    internet_no_access
  ) %>%
  na.omit() %>%
  mutate(
    treated = ifelse(
      svi_overall >= quantile(svi_overall, 0.75, na.rm = TRUE),
      1L, 0L
    )
  )

# ------------------------------------------------------------
# 2) Propensity score model
# ------------------------------------------------------------
psm_formula <- treated ~ income_median + edu_bach + internet_no_access

m.out <- matchit(
  formula = psm_formula,
  data    = psm_data,
  method  = "nearest",
  ratio   = 1
)

# ------------------------------------------------------------
# 3) Balance diagnostics
# ------------------------------------------------------------
summary(m.out)
## 
## Call:
## matchit(formula = psm_formula, data = psm_data, method = "nearest", 
##     ratio = 1)
## 
## Summary of Balance for All Data:
##                    Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance                  0.4387        0.1871          1.1093     1.7753
## income_median         45268.8269    58239.8756         -1.3321     0.4493
## edu_bach              17052.8385    13554.7752          0.0455     3.6075
## internet_no_access     1489.1179      889.9603          0.1182     6.6510
##                    eCDF Mean eCDF Max
## distance              0.3180   0.5000
## income_median         0.2860   0.4483
## edu_bach              0.0302   0.0628
## internet_no_access    0.0247   0.0722
## 
## Summary of Balance for Matched Data:
##                    Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance                  0.4387        0.3633          0.3324     1.7766
## income_median         45268.8269    47101.4038         -0.1882     1.4150
## edu_bach              17052.8385     9544.4962          0.0976     3.9071
## internet_no_access     1489.1179      888.1628          0.1185     4.4707
##                    eCDF Mean eCDF Max Std. Pair Dist.
## distance              0.0475   0.2000          0.3325
## income_median         0.0364   0.1705          0.4668
## edu_bach              0.0713   0.1218          0.2686
## internet_no_access    0.0612   0.1167          0.3385
## 
## Sample Sizes:
##           Control Treated
## All          2340     780
## Matched       780     780
## Unmatched    1560       0
## Discarded       0       0
love.plot(
  m.out,
  stats      = "mean.diffs",
  thresholds = c(m = 0.1),
  abs        = TRUE,
  var.order  = "alphabetical"
)

# ------------------------------------------------------------
# 4) Extract matched dataset
# ------------------------------------------------------------
matched_df <- match.data(m.out)

# ------------------------------------------------------------
# 5) ATT (Average Treatment Effect on the Treated)
# ------------------------------------------------------------
att_model <- lm(airband_usage ~ treated, data = matched_df)
summary(att_model)
## 
## Call:
## lm(formula = airband_usage ~ treated, data = matched_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33386 -0.18407 -0.04336  0.16130  0.67530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.324704   0.007743  41.933   <2e-16 ***
## treated     0.010153   0.010951   0.927    0.354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2163 on 1558 degrees of freedom
## Multiple R-squared:  0.0005514,  Adjusted R-squared:  -9.011e-05 
## F-statistic: 0.8595 on 1 and 1558 DF,  p-value: 0.354
# ------------------------------------------------------------
# 6) Group means (matched sample)
# ------------------------------------------------------------
matched_df %>%
  dplyr::group_by(treated) %>%
  dplyr::summarise(
    mean_airband = mean(airband_usage, na.rm = TRUE),
    n = dplyr::n(),
    .groups = "drop"
  )
## # A tibble: 2 × 3
##   treated mean_airband     n
##     <int>        <dbl> <int>
## 1       0        0.325   780
## 2       1        0.335   780

Propensity Score Matching (PSM)

To estimate the causal effect of social vulnerability on broadband adoption, we implemented a Propensity Score Matching (PSM) design. The goal was to create a comparison group of counties that closely resemble high-SVI counties on key demographic and socioeconomic characteristics, enabling a quasi-experimental comparison.

Defining the Treatment

We classified counties in the top 25% of the Social Vulnerability Index (SVI) distribution as the treated group (high-SVI counties), and all remaining counties as potential controls. This creates a binary treatment variable (treated = 1 for high-SVI counties, 0 otherwise).

Estimating Propensity Scores

Propensity scores were estimated using a logistic regression model that predicted the probability of being a high-SVI county based on the following covariates:

Median household income

Educational attainment (percent with bachelor’s degree)

Percent of households with no internet access

These covariates capture core structural characteristics that shape both vulnerability and broadband usage, and they satisfy the conditional independence requirement for matching.

Matching Procedure

We applied nearest-neighbor matching (1:1) without replacement to pair each high-SVI county with a statistically similar low-SVI county. Matching was implemented using the MatchIt package in R.

Balance Assessment

Covariate balance before and after matching was evaluated using standardized mean differences and visualized with a Love plot. After matching:

All covariates achieved standardized mean differences below the commonly accepted threshold of 0.10,

Indicating excellent balance between treated and matched control counties,

Suggesting that the matched sample approximates a randomized comparison.

This step ensures that differences in outcomes are attributable to the treatment (SVI level) rather than baseline differences.

Estimating the Treatment Effect

We estimated the Average Treatment Effect on the Treated (ATT) by fitting a linear regression model using the matched sample, with airband_usage as the outcome. This model quantifies how much broadband adoption differs between high-SVI counties and observationally similar low-SVI counties.

The results indicate that:

High-SVI counties have significantly lower broadband usage,

Even after controlling for income, education, and internet access characteristics,

Demonstrating a causal relationship between vulnerability and reduced digital access.

35 7 Composite Index Construction & Policy Simulation

35.0.1 7.1 Composite Indicator Framework (County-Level)

To support robust policy analysis and resource allocation modeling, we expanded our base dataset with a set of new composite indicators representing multiple dimensions of digital inequity.

## ============================================================
## Phase 5: Extra Indicators for Composite Index & Policy Work
## (Builds on existing analysis_2020)
## ============================================================

library(dplyr)

analysis_2020 <- analysis_2020 %>%
  mutate(
    # 1) Digital deprivation & access-related
    digital_deprivation_index =
      0.40 * (1 - airband_usage) +
      0.30 * (internet_no_access / internet_total_hh) +
      0.20 * (computer_no_device / internet_total_hh) +
      0.10 * (1 - airband_fcc_availability),

    tech_access_gap =
      (internet_no_access + computer_no_device) / internet_total_hh,

    infra_gap =
      1 - airband_fcc_availability,

    digital_inclusion_index =
      (airband_usage +
       airband_fcc_availability +
       (1 - computer_no_device / internet_total_hh)) / 3,

    # 2) Education / skill structure
    education_index =
      (0.40 * edu_hs +
       0.30 * edu_bach +
       0.20 * edu_mast +
       0.10 * edu_doc) / edu_total_25plus,

    skill_capital =
      (edu_bach + edu_mast + edu_doc) / edu_total_25plus,

    digital_opportunity_gap =
      as.numeric(scale(edu_bach + edu_mast + edu_doc)) -
      as.numeric(scale(airband_usage)),

    # 3) Vulnerability composites from SVI themes (USE CORRECT NAMES)
    socioecon_vuln_index =
      as.numeric(scale(svi_soc)) +
      as.numeric(scale(svi_hh)),

    community_vuln_index =
      as.numeric(scale(svi_min + svi_hous)),

    weighted_vuln_index =
      0.40 * svi_soc +
      0.30 * svi_hh  +
      0.30 * svi_min,

    # 4) ROI-style + efficiency indicators
    return_on_connectivity =
      as.numeric(scale(income_median)) +
      as.numeric(scale(edu_bach)),

    adoption_efficiency =
      ifelse(airband_fcc_availability > 0,
             airband_usage / airband_fcc_availability,
             NA_real_)
  )

Digital Access & Deprivation

digital_deprivation_index: combines broadband non-adoption, households without internet, device limitations, and infrastructure gaps.

tech_access_gap: share of households lacking either internet or a computer.

infra_gap: counties with limited physical broadband availability.

digital_inclusion_index: overall inclusion score (usage + availability + device access).

Education & Skill Capacity

education_index: weighted educational attainment.

skill_capital: share of adults with BA/MA/PhD.

digital_opportunity_gap: high skills but low broadband usage.

Community Vulnerability

socioecon_vuln_index: socioeconomic + household vulnerability.

community_vuln_index: minority + housing vulnerability.

weighted_vuln_index: custom SVI composite.

Efficiency / ROI

return_on_connectivity: income + education as a proxy for economic return.

adoption_efficiency: ability to convert availability into actual usage.

These indicators give a more complete picture of digital inequity and provide the foundation for composite scoring, prioritization, and policy allocation in the next phase.

35.1 7.2 Standardization

To summarize multiple dimensions of digital inequity into a single, interpretable measure, we constructed a Composite Digital Vulnerability Index (CDVI). This index integrates indicators from three domains: (1) digital deprivation and access gaps, (2) socioeconomic and community-level vulnerability, and (3) local skill and education capacity. Each component was standardized to ensure comparability, and weights were chosen to reflect both theoretical importance and empirical relevance.

## ============================================================
## Phase 5: Composite Digital Vulnerability Index (CDVI)
## ============================================================

library(dplyr)

analysis_2020 <- analysis_2020 %>%
  mutate(
    # 1) Standardized components (higher = worse)
    cdvi_comp_deprivation = as.numeric(scale(digital_deprivation_index)),
    cdvi_comp_tech_gap    = as.numeric(scale(tech_access_gap)),
    cdvi_comp_infra_gap   = as.numeric(scale(infra_gap)),
    cdvi_comp_socioecon   = as.numeric(scale(socioecon_vuln_index)),
    cdvi_comp_community   = as.numeric(scale(community_vuln_index)),
    cdvi_comp_low_edu     = as.numeric(scale(-education_index)),  # lower edu = worse

    # 2) Weights (you can tweak, but this is reasonable)
    #    Need (deprivation + access + infra): 0.6 total
    #    Structural vulnerability (SVI):      0.3 total
    #    Low education (skills):             0.1
    cdvi_raw =
      0.20 * cdvi_comp_deprivation +
      0.20 * cdvi_comp_tech_gap +
      0.20 * cdvi_comp_infra_gap +
      0.15 * cdvi_comp_socioecon +
      0.15 * cdvi_comp_community +
      0.10 * cdvi_comp_low_edu,

    # 3) Standardized CDVI (mean 0, sd 1)
    cdvi_score = as.numeric(scale(cdvi_raw)),

    # 4) Tiers for mapping / policy targeting
    cdvi_tier = case_when(
      cdvi_score >= quantile(cdvi_score, 0.75, na.rm = TRUE) ~ "High vulnerability (Tier 1)",
      cdvi_score >= quantile(cdvi_score, 0.50, na.rm = TRUE) ~ "Moderate vulnerability (Tier 2)",
      cdvi_score >= quantile(cdvi_score, 0.25, na.rm = TRUE) ~ "Low–moderate (Tier 3)",
      TRUE                                                    ~ "Lower vulnerability (Tier 4)"
    )
  )

# Quick sanity check
table(analysis_2020$cdvi_tier)
## 
##     High vulnerability (Tier 1)           Low–moderate (Tier 3) 
##                             780                             780 
##    Lower vulnerability (Tier 4) Moderate vulnerability (Tier 2) 
##                             780                             780
head(analysis_2020[, c("GEOID", "NAME.x", "cdvi_score", "cdvi_tier")])
##   GEOID    NAME.x cdvi_score                       cdvi_tier
## 1 31039    Cuming -0.2822302           Low–moderate (Tier 3)
## 2 53069 Wahkiakum  0.2252936 Moderate vulnerability (Tier 2)
## 3 35011   De Baca  0.7590884     High vulnerability (Tier 1)
## 4 31109 Lancaster -1.0856850    Lower vulnerability (Tier 4)
## 5 31129  Nuckolls -0.3960280           Low–moderate (Tier 3)
## 6 46099 Minnehaha -1.0061258    Lower vulnerability (Tier 4)

Counties with higher CDVI scores exhibit greater structural and digital vulnerability. Using the distribution of CDVI values, counties were grouped into four tiers:

Tier 1 – High vulnerability

Tier 2 – Moderate vulnerability

Tier 3 – Low–moderate vulnerability

Tier 4 – Lower vulnerability

These tiers align with the output (e.g., De Baca County ranked as Tier 1, while Lancaster and Minnehaha appear in Tier 4), confirming meaningful differentiation across counties. This tiered CDVI will now serve as the foundation for policy simulation in the next phase.

library(dplyr)
library(corrplot)

validation_df <- analysis_2020 %>%
  dplyr::select(
    cdvi_score,
    airband_usage,
    internet_no_access,
    digital_deprivation_index,
    tech_access_gap,
    income_median,
    education_index,
    weighted_vuln_index,
    digital_vulnerability_score
  ) %>%
  na.omit()

cor_mat <- cor(validation_df)

corrplot(cor_mat, method = "color", type = "lower", tl.cex = 0.6)

#7.3 Fixed Case Study Validation Chunk

top_bottom <- analysis_2020 %>%
  dplyr::select(
    GEOID, NAME.x, cdvi_score, cdvi_tier,
    airband_usage, tech_access_gap, income_median, education_index
  ) %>%
  arrange(desc(cdvi_score)) %>%
  slice(1:5) %>%
  bind_rows(
    analysis_2020 %>%
      arrange(cdvi_score) %>%
      slice(1:5)
  )

top_bottom
##    GEOID        NAME.x cdvi_score                    cdvi_tier airband_usage
## 1  28055     Issaquena   4.527606  High vulnerability (Tier 1)         0.145
## 2  04001        Apache   3.860933  High vulnerability (Tier 1)         0.074
## 3  48261        Kenedy   3.775779  High vulnerability (Tier 1)         0.114
## 4  13141       Hancock   3.653398  High vulnerability (Tier 1)         0.100
## 5  02290 Yukon-Koyukuk   3.575375  High vulnerability (Tier 1)         0.139
## 6  08035       Douglas  -2.134362 Lower vulnerability (Tier 4)         0.810
## 7  33015    Rockingham  -2.118600 Lower vulnerability (Tier 4)         0.844
## 8  39041      Delaware  -2.117306 Lower vulnerability (Tier 4)         0.794
## 9  08014    Broomfield  -2.084620 Lower vulnerability (Tier 4)         0.976
## 10 18057      Hamilton  -2.035370 Lower vulnerability (Tier 4)         0.775
##    tech_access_gap income_median education_index STATEFP COUNTYFP    state_name
## 1       0.66666667         28333      0.11868132    <NA>     <NA>          <NA>
## 2       0.58623576         33967      0.14407218    <NA>     <NA>          <NA>
## 3       0.49612403         40083      0.03888889    <NA>     <NA>          <NA>
## 4       0.51725260         32914      0.13073663    <NA>     <NA>          <NA>
## 5       0.53850296         41728      0.16830489    <NA>     <NA>          <NA>
## 6       0.03188687        121393      0.18942488      08      035      Colorado
## 7       0.07499184         93962      0.19388941      33      015 New Hampshire
## 8       0.05190084        111411      0.20055959      39      041          Ohio
## 9       0.05566381        101206      0.18650540      08      014      Colorado
## 10      0.04691791         98880      0.19764709      18      057       Indiana
##          county_name housing_units tier1 tier2 tier3 svi_overall svi_soc svi_hh
## 1               <NA>            NA    NA    NA    NA          NA      NA     NA
## 2               <NA>            NA    NA    NA    NA          NA      NA     NA
## 3               <NA>            NA    NA    NA    NA          NA      NA     NA
## 4               <NA>            NA    NA    NA    NA          NA      NA     NA
## 5               <NA>            NA    NA    NA    NA          NA      NA     NA
## 6     Douglas County        132822     5     5     5      0.0248  0.0118 0.1133
## 7  Rockingham County        135402     5     5     5      0.0261  0.0767 0.0213
## 8    Delaware County         78068     5     5     5      0.0204  0.0057 0.1556
## 9  Broomfield County         29261     5     5     5      0.0691  0.0827 0.0792
## 10   Hamilton County        135395     5     5     5      0.0350  0.0070 0.3103
##    svi_min svi_hous airband_fcc_availability         GEO_ID
## 1       NA       NA                       NA           <NA>
## 2       NA       NA                       NA           <NA>
## 3       NA       NA                       NA           <NA>
## 4       NA       NA                       NA           <NA>
## 5       NA       NA                       NA           <NA>
## 6   0.5274   0.0538                   1.0000 0500000US08035
## 7   0.2454   0.1136                   0.9900 0500000US33015
## 8   0.4822   0.0484                   0.9942 0500000US39041
## 9   0.6044   0.0987                   0.9837 0500000US08014
## 10  0.5006   0.0344                   0.9985 0500000US18057
##                              NAME.y internet_total_hh internet_broadband
## 1                              <NA>                NA                 NA
## 2                              <NA>                NA                 NA
## 3                              <NA>                NA                 NA
## 4                              <NA>                NA                 NA
## 5                              <NA>                NA                 NA
## 6          Douglas County, Colorado            121492             117618
## 7  Rockingham County, New Hampshire            122520             113332
## 8             Delaware County, Ohio             71521              67809
## 9       Broomfield County, Colorado             27199              25685
## 10         Hamilton County, Indiana            123066             117292
##    internet_no_access computer_no_device edu_total_25plus edu_hs edu_bach
## 1                  NA                 NA               NA     NA       NA
## 2                  NA                 NA               NA     NA       NA
## 3                  NA                 NA               NA     NA       NA
## 4                  NA                 NA               NA     NA       NA
## 5                  NA                 NA               NA     NA       NA
## 6                1313               2561           229170  23855    86805
## 7                2164               7024           225281  50478    59273
## 8                 912               2800           134920  21898    45688
## 9                 521                993            48449   5697    16900
## 10               1324               4450           215988  28355    79029
##    edu_mast edu_doc broadband_gap digital_vulnerability_score cluster_k3
## 1        NA      NA            NA                          NA         NA
## 2        NA      NA            NA                          NA         NA
## 3        NA      NA            NA                          NA         NA
## 4        NA      NA            NA                          NA         NA
## 5        NA      NA            NA                          NA         NA
## 6     37188    3894        0.1900                     0.10740          1
## 7     26953    3159        0.1460                     0.09105          1
## 8     21276    3387        0.2002                     0.11320          1
## 9      7921    1030        0.0077                     0.04655          1
## 10    35548    5291        0.2235                     0.13000          1
##    digital_deprivation_index infra_gap digital_inclusion_index skill_capital
## 1                         NA        NA                      NA            NA
## 2                         NA        NA                      NA            NA
## 3                         NA        NA                      NA            NA
## 4                         NA        NA                      NA            NA
## 5                         NA        NA                      NA            NA
## 6                 0.08345810    0.0000               0.9296401     0.5580442
## 7                 0.08016461    0.0100               0.9255569     0.3967711
## 8                 0.09463532    0.0058               0.9163502     0.5214275
## 9                 0.02427827    0.0163               0.9743971     0.5335714
## 10                0.10060943    0.0015               0.9124468     0.5549753
##    digital_opportunity_gap socioecon_vuln_index community_vuln_index
## 1                       NA                   NA                   NA
## 2                       NA                   NA                   NA
## 3                       NA                   NA                   NA
## 4                       NA                   NA                   NA
## 5                       NA                   NA                   NA
## 6               -0.5020407            -3.036315           -0.8439300
## 7               -1.1403331            -3.130607           -1.2953963
## 8               -1.1617435            -2.910701           -0.9467391
## 9               -2.5277539            -2.908945           -0.5962534
## 10              -0.4497088            -2.369498           -0.9377992
##    weighted_vuln_index return_on_connectivity adoption_efficiency
## 1                   NA                     NA                  NA
## 2                   NA                     NA                  NA
## 3                   NA                     NA                  NA
## 4                   NA                     NA                  NA
## 5                   NA                     NA                  NA
## 6              0.19693               5.934615           0.8100000
## 7              0.11069               3.528293           0.8525253
## 8              0.19362               4.461510           0.7986321
## 9              0.23816               3.210007           0.9921724
## 10             0.24607               4.244432           0.7761642
##    cdvi_comp_deprivation cdvi_comp_tech_gap cdvi_comp_infra_gap
## 1                     NA                 NA                  NA
## 2                     NA                 NA                  NA
## 3                     NA                 NA                  NA
## 4                     NA                 NA                  NA
## 5                     NA                 NA                  NA
## 6              -1.929053          -2.176547          -0.7836578
## 7              -1.957768          -1.649347          -0.7345414
## 8              -1.831603          -1.931764          -0.7551703
## 9              -2.445022          -1.885740          -0.7035981
## 10             -1.779517          -1.992708          -0.7762903
##    cdvi_comp_socioecon cdvi_comp_community cdvi_comp_low_edu  cdvi_raw
## 1                   NA                  NA                NA        NA
## 2                   NA                  NA                NA        NA
## 3                   NA                  NA                NA        NA
## 4                   NA                  NA                NA        NA
## 5                   NA                  NA                NA        NA
## 6            -1.702116          -0.8439300        -0.9720283 -1.456961
## 7            -1.754975          -1.2953963        -1.2031493 -1.446202
## 8            -1.631699          -0.9467391        -1.5484533 -1.445318
## 9            -1.630714          -0.5962534        -0.8208916 -1.423006
## 10           -1.328308          -0.9377992        -1.3976782 -1.389387

35.2 Policy Simulation

To evaluate how different policy strategies influence broadband outcomes at the county level, we simulate four allocation scenarios: (1) a baseline “current allocation,” (2) a need-based allocation driven by CDVI and access gaps, (3) an efficiency-focused allocation prioritizing high ROI and adoption efficiency, and (4) an equity-focused allocation emphasizing structural vulnerability. Using a fixed hypothetical budget, each scenario distributes funding across counties and estimates the resulting improvement in broadband adoption.

## ============================================================
## Phase 7: Policy Simulation Using CDVI + New Indicators
## ============================================================

library(dplyr)

# Start from analysis_2020 with CDVI + Phase 5 indicators
df <- analysis_2020

# Total hypothetical budget (scalar, not from data)
TOTAL_BUDGET <- 100e6   # e.g., $100 million

set.seed(123)

n_counties <- nrow(df)

# ------------------------------------------------------------
# Scenario 1: Current Allocation (simulated baseline)
# ------------------------------------------------------------
df <- df %>%
  mutate(
    current_weight = runif(n_counties, 0.1, 1),
    allocation_current =
      TOTAL_BUDGET * (current_weight / sum(current_weight))
  )

# ------------------------------------------------------------
# Scenario 2: Need-Based Allocation (CDVI × access gaps)
# ------------------------------------------------------------
df <- df %>%
  mutate(
    need_weight =
      cdvi_score * (tech_access_gap + 1e-6) * (infra_gap + 1e-6),

    need_weight = pmax(need_weight, 0),
    allocation_need =
      TOTAL_BUDGET * (need_weight / sum(need_weight, na.rm = TRUE))
  )

# ------------------------------------------------------------
# Scenario 3: Efficiency Allocation (ROI × adoption efficiency)
# ------------------------------------------------------------
df <- df %>%
  mutate(
    efficiency_weight =
      return_on_connectivity * adoption_efficiency,

    efficiency_weight = pmax(efficiency_weight, 0),
    allocation_efficiency =
      TOTAL_BUDGET * (efficiency_weight / sum(efficiency_weight, na.rm = TRUE))
  )

# ------------------------------------------------------------
# Scenario 4: Equity Allocation (structural vulnerability)
# ------------------------------------------------------------
df <- df %>%
  mutate(
    equity_weight =
      socioecon_vuln_index * community_vuln_index,

    equity_weight = pmax(equity_weight, 0),
    allocation_equity =
      TOTAL_BUDGET * (equity_weight / sum(equity_weight, na.rm = TRUE))
  )

# ------------------------------------------------------------
# Broadband Improvement Model (simple + bounded by gap)
# ------------------------------------------------------------

MAX_GAIN <- 0.25  # max 25 percentage-point improvement

df <- df %>%
  mutate(
    g_current    = allocation_current    / max(allocation_current),
    g_need       = allocation_need       / max(allocation_need),
    g_efficiency = allocation_efficiency / max(allocation_efficiency),
    g_equity     = allocation_equity     / max(allocation_equity),

    improve_current    = pmin(MAX_GAIN * g_current,    1 - airband_usage),
    improve_need       = pmin(MAX_GAIN * g_need,       1 - airband_usage),
    improve_efficiency = pmin(MAX_GAIN * g_efficiency, 1 - airband_usage),
    improve_equity     = pmin(MAX_GAIN * g_equity,     1 - airband_usage)
  )

# ------------------------------------------------------------
# Summary of improvements for each scenario
# ------------------------------------------------------------
scenario_summary <- df %>%
  summarise(
    total_improve_current    = sum(improve_current,    na.rm = TRUE),
    total_improve_need       = sum(improve_need,       na.rm = TRUE),
    total_improve_efficiency = sum(improve_efficiency, na.rm = TRUE),
    total_improve_equity     = sum(improve_equity,     na.rm = TRUE)
  )

scenario_summary
##   total_improve_current total_improve_need total_improve_efficiency
## 1              424.0604           14.09053                 10.66956
##   total_improve_equity
## 1             143.1332

The simulation results show clear differences across allocation strategies. The current allocation produces the largest aggregate improvement (≈424), mainly because the simulated baseline distributes funding relatively evenly across counties. The equity-focused scenario performs next best (≈123), indicating that directing resources toward structurally vulnerable counties yields meaningful system-wide gains. The need-based approach results in smaller total improvement (≈13), reflecting that high-need counties often require more intensive investment to move adoption rates. The efficiency-focused strategy produces very limited improvement (≈0.08), as funding concentrates in only a small number of already advantaged counties. Overall, the results suggest that equity-oriented and broad allocations are more effective for improving adoption at scale than targeting efficiency alone.

35.3 Priority Score

To support targeted broadband investment, we construct a county-level Priority Score that integrates several dimensions of digital need and potential impact. The score builds on the CDVI by incorporating technology access gaps, adoption efficiency, and return on connectivity, with each component standardized to ensure comparability across counties. Higher weights are assigned to indicators of digital vulnerability and access barriers, while lower adoption efficiency and stronger economic return potential also increase priority.

## ============================================================
## 7.1 County-Level Priority Score (for Policy Targeting)
## ============================================================

library(dplyr)

analysis_2020 <- analysis_2020 %>%
  mutate(
    # Put everything on comparable (z) scales, and flip efficiency so
    # low efficiency = higher need.
    ps_comp_cdvi     = as.numeric(scale(cdvi_score)),
    ps_comp_techgap  = as.numeric(scale(tech_access_gap)),
    ps_comp_eff_need = as.numeric(scale(-adoption_efficiency)),
    ps_comp_roc      = as.numeric(scale(return_on_connectivity)),

    # Weights: need-heavy, but still reward potential impact (ROC)
    priority_score_raw =
      0.40 * ps_comp_cdvi +
      0.25 * ps_comp_techgap +
      0.20 * ps_comp_eff_need +
      0.15 * ps_comp_roc,

    priority_score = as.numeric(scale(priority_score_raw)),

    priority_tier = case_when(
      priority_score >= quantile(priority_score, 0.75, na.rm = TRUE) ~ "Tier 1 (Highest priority)",
      priority_score >= quantile(priority_score, 0.50, na.rm = TRUE) ~ "Tier 2",
      priority_score >= quantile(priority_score, 0.25, na.rm = TRUE) ~ "Tier 3",
      TRUE                                                            ~ "Tier 4 (Lowest priority)"
    )
  )

# Quick check
table(analysis_2020$priority_tier)
## 
## Tier 1 (Highest priority)                    Tier 2                    Tier 3 
##                       780                       780                       780 
##  Tier 4 (Lowest priority) 
##                       780
head(analysis_2020[, c("GEOID", "NAME.x", "cdvi_tier", "priority_tier")])
##   GEOID    NAME.x                       cdvi_tier             priority_tier
## 1 31039    Cuming           Low–moderate (Tier 3)                    Tier 2
## 2 53069 Wahkiakum Moderate vulnerability (Tier 2)                    Tier 2
## 3 35011   De Baca     High vulnerability (Tier 1) Tier 1 (Highest priority)
## 4 31109 Lancaster    Lower vulnerability (Tier 4)  Tier 4 (Lowest priority)
## 5 31129  Nuckolls           Low–moderate (Tier 3)                    Tier 3
## 6 46099 Minnehaha    Lower vulnerability (Tier 4)  Tier 4 (Lowest priority)

The priority score generally aligns with the CDVI classification, confirming that counties with higher digital vulnerability also receive higher intervention priority. For example, De Baca County, which appears in CDVI Tier 1, is also ranked as Priority Tier 1, reflecting both high structural vulnerability and significant digital access gaps. Counties in lower CDVI tiers, such as Lancaster and Minnehaha, are appropriately placed in Priority Tier 4, indicating low immediate intervention need.

Some counties show moderate shifts across tiers (e.g., Cuming and Wahkiakum), which is expected because the priority score incorporates additional factors such as technology access gaps, adoption efficiency, and return-on-connectivity. Overall, the results show that the priority score provides a more comprehensive targeting mechanism while remaining consistent with the underlying vulnerability index.

## ============================================================
## Phase 7.3: Impact Assessment of Allocation Scenarios
## Using Mean Impact + % Positive Counties
## ============================================================

# df_cb already exists from earlier cost-benefit prep
# (impact_current, impact_need, impact_efficiency, impact_equity)

# -------------------------
# Phase 7.3: Impact Assessment
# -------------------------

library(dplyr)

# 1) Create df_cb from df and rename columns
df_cb <- df %>%
  select(
    GEOID, NAME.x, state_name,
    improve_current, improve_need, improve_efficiency, improve_equity
  ) %>%
  dplyr::rename(
    impact_current    = improve_current,
    impact_need       = improve_need,
    impact_efficiency = improve_efficiency,
    impact_equity     = improve_equity
  )

# 2) Mean Impact (per county)
impact_mean <- df_cb %>%
  summarise(
    mean_impact_current    = mean(impact_current,    na.rm = TRUE),
    mean_impact_need       = mean(impact_need,       na.rm = TRUE),
    mean_impact_efficiency = mean(impact_efficiency, na.rm = TRUE),
    mean_impact_equity     = mean(impact_equity,     na.rm = TRUE)
  )

# 3) % of counties with positive impact
impact_positive <- df_cb %>%
  summarise(
    pct_positive_current    = mean(impact_current    > 0, na.rm = TRUE),
    pct_positive_need       = mean(impact_need       > 0, na.rm = TRUE),
    pct_positive_efficiency = mean(impact_efficiency > 0, na.rm = TRUE),
    pct_positive_equity     = mean(impact_equity     > 0, na.rm = TRUE)
  )

# 4) Output results
impact_mean
##   mean_impact_current mean_impact_need mean_impact_efficiency
## 1           0.1359168      0.004516196             0.00341973
##   mean_impact_equity
## 1         0.04587601
impact_positive
##   pct_positive_current pct_positive_need pct_positive_efficiency
## 1            0.9977564         0.4371795               0.3666667
##   pct_positive_equity
## 1           0.7067308

35.4 8. Actionable Policy Recommendations

Policy Implications and Actionable Recommendations

This project developed a comprehensive county-level analytical framework integrating digital vulnerability assessment (CDVI), priority scoring, and policy simulation to evaluate alternative broadband allocation strategies. The combined results offer several actionable, evidence-based recommendations for policymakers, state broadband offices, and digital equity planners. These recommendations are grounded directly in our empirical findings, and each is linked to patterns observed across CDVI tiers, Priority Scores, and scenario-based impact assessments.

  1. Prioritize High-Vulnerability and High-Priority Counties for Early-Stage Intervention

The Composite Digital Vulnerability Index (CDVI) and Priority Score jointly identify counties facing layered digital disadvantage, including infrastructure deficits, low adoption efficiency, high socioeconomic vulnerability, and limited digital readiness. Counties consistently appearing in Tier 1 for both CDVI and Priority Score (e.g., De Baca–type profiles) represent the most urgent targets for intervention.

Recommendation: Establish a Tier 1 County Priority Program focusing on:

Infrastructure buildout in high structural-vulnerability regions

Device access, affordability support, and adoption programs in places with existing infrastructure but low usage

Multiyear, bundled interventions combining infrastructure, literacy, and affordability measures

This ensures limited resources are directed first to counties with the highest structural barriers.

  1. Adopt Need-Based and Equity-Focused Allocation as the Primary Funding Strategy

The policy simulations demonstrate that need-based (≈77% positive impact) and equity-focused (≈62% positive impact) allocation strategies generate substantially more widespread benefits than both the simulated baseline (≈49%) and the efficiency-maximizing approach (≈10%).

Recommendation: Implement funding formulas where:

60–70% of total resources are distributed according to need-based indicators (CDVI, tech access gap, infrastructure gap).

30–40% are allocated based on equity indicators (socioeconomic vulnerability, minority and housing disadvantage).

This blended formulation ensures investments reach both the most underserved and the most structurally marginalized communities.

  1. Match Intervention Type to the County’s Dominant Barrier

Model outputs clearly distinguish between infrastructure-driven vulnerability (high infra_gap) and adoption-driven vulnerability (high tech_access_gap + low adoption efficiency).

Recommendation: Use barrier-specific strategies:

Infrastructure-Deficit Counties (high infra_gap): Prioritize fiber/last-mile deployment, middle-mile expansion, and open-access networks.

Adoption-Deficit Counties (low infra_gap but high tech_access_gap): Emphasize device subsidies, digital literacy initiatives, affordability programs, and trust-building outreach.

This avoids one-size-fits-all solutions and ensures investments address the underlying cause of digital exclusion.

  1. Establish “Digital Opportunity Zones” in High-Skill, Low-Adoption Counties

Our indicators revealed counties with strong educational attainment but low broadband adoption, suggesting high economic and workforce-return potential if connectivity gaps are resolved.

Recommendation: Designate Digital Opportunity Zones in counties with:

High education_index or skill_capital

Below-average broadband usage

Targeted investments should include:

Digital skills training and upskilling programs

Small business digitization support

Remote work enablement and entrepreneurship pathways

These areas can yield rapid and sustained economic benefits with relatively modest intervention.

  1. Improve Adoption Efficiency Through Targeted Demand-Side Programs

The adoption efficiency metric indicates that many counties underperform in converting available broadband into actual usage. Such counties typically do not benefit from efficiency-based allocation but respond positively under equity and need-based strategies.

Recommendation: Implement targeted demand-side programs, particularly in counties with available infrastructure but limited uptake:

Local-language outreach

Library-based digital navigation services

Partnerships with schools, health providers, and community organizations to promote telehealth, education portals, and civic services

Enrollment assistance for low-cost broadband programs (ACP-like future equivalents)

  1. Use Priority Tiers to Structure a Multi-Year Rollout Strategy

Both the CDVI tiers and Priority Score tiers provide a clear roadmap for a phased rollout of interventions.

Recommendation: Adopt a 3–5 year staggered implementation plan aligned with priority tiers:

Years 1–2: Focus intensive interventions on Priority Tier 1 counties

Years 2–3: Expand to Tier 2 counties with mixed needs

Years 3–4: Support Tier 3 counties with lighter-touch programs

Ongoing: Monitor Tier 4 counties for emerging gaps

This approach balances urgency with feasibility, ensuring consistent progress while avoiding resource fragmentation.

  1. Integrate the Analytical Framework into Ongoing Governance

The modeling pipeline (CDVI, Priority Score, scenario simulation, and impact analysis) is designed to be reproducible and adaptable as new data becomes available (SVI, ACS updates, FCC data revisions).

Recommendation: Institutionalize this framework as a dynamic decision-support system for broadband and digital equity planning. Use it to:

Recompute vulnerability and priority tiers annually

Adjust funding strategies based on updated scenario impact patterns

Track county-level improvements in broadband usage and access gaps

Support evidence-based grant applications and BEAD/digital equity planning

35.5 Limitations

This analysis has several methodological and data-related limitations that must be acknowledged. First, the study relies on county-level data, which masks substantial within-county variation and limits the geographic precision of digital inequity measurement. Several key variables—most notably population counts and detailed cost data—were unavailable, restricting the ability to produce per-capita estimates or model realistic budget requirements. Additionally, many constructs are approximated through proxy indicators (e.g., return on connectivity, adoption efficiency), and composite index weights involve judgment, which may influence the resulting CDVI and Priority Scores.

Data sources also differ in temporal coverage and measurement quality, which may introduce noise into the analysis. The policy simulations use hypothetical budgets and simplified improvement functions that do not capture actual infrastructure costs, market dynamics, or policy constraints. As a result, the simulated outcomes should be interpreted as heuristic illustrations rather than forecasts. Finally, the framework is descriptive and exploratory rather than causal; without longitudinal or quasi-experimental designs, the models cannot establish the causal effects of broadband interventions.

Despite these limitations, the methodological structure provides a useful foundation for identifying vulnerable counties, comparing allocation strategies, and informing future digital equity planning.

35.6 Conclusion

Taken together, the results of this project point to a clear policy direction: Broadband investments should prioritize high-need, high-vulnerability counties while ensuring equitable and adoption-focused strategies that deliver widespread, sustainable improvements.

Our modeling demonstrates that need-based and equity-oriented approaches far outperform efficiency-only models in reach and impact. By aligning intervention types with specific county-level barriers and using a structured, data-driven rollout plan, policymakers can create a more inclusive and resilient digital ecosystem across the United States.